Recently at work, we were performance testing a few use cases of an application which are completely bottlenecked by the database. Actually, we were trying to figure out how much data we could input in the system, while sustaining the required response times. The "input into system" use cases obviously involves a few "insert" operations on the database, but also a few "select" queries on what is already in the database for verification of the inserted data etc. Initially we simply wanted to see how the performance of this "input into system" use case degrades when the database gets filled more and more.
A colleague of mine performed the tests (thanks Bart, if you are reading this), and created some nice graphs of the results. One of these graphs looks like this:
On the vertical axis you can see the response times. On the horizontal axis the number of "input into system" operations that have already been performed.
You notice that, as expected, the performance degrades as the database is filled with more data. But something strange is visible: the response times bounce up and down between some kind of minimum and maximum. Response times for individual requests seem to be either "slow", either "fast". We were convinced that if we could find the cause of the response times sometimes being slow, we could fix it, and all response times would be on the lowest line in the graph.
This is where the discussion started to diverge. We started thinking about all kinds of possible explanations as to why this could be the case. Many graphs where produced and many tests conducted, to try and figure out what the underlying cause could be.
Although this kind of reasoning is highly enjoyable, and watching all these graphs is very intriguing indeed, I proposed a more pragmatic approach: lets just find the slowest query in the system and try to improve that one... without bothering about trying to explain the strange graphs. Or as Kirk Pepperdine would say: "big gains first".
A quick check with vmstat confirmed our initial idea that the database was indeed the bottleneck. The MySQL process was using all of the CPU (user time). This also meant that I/O was not the problem. MySQL probably had all the data cached in memory, but some query (or queries) demanded lots of CPU time to be executed.
Next we added logging of the database queries being executed using P6Spy. We let the test run for a few minutes and then opened the log in IroneEye. This gives a nice overview of all the queries that have been executed, with their execution times. One query came out as being about 5 times slower as the other ones. We had a look at the actual query, and it turned out not only to be slow but also incorrect. The corrected version of the query had to do less joins, and was much faster. Running the test again resulted in a constant linear degradation of the response times, as expected.
Actually, when running the test for a longer time, the bouncing up and down pattern came back. This time it was much less severe though, and it happened only when much more data was already in the system. We'll probably have to hunt down the next slowest query now, in order to try and improve the situation even more.
Taken all together, it took us only 30 minutes or so to perform the test with P6Spy, analyze the logs with IroneEye, find the slowest query, fix the query and run the test again. Compared to endless discussions about possible explanations for the strange behaviour, the pragmatic approach clearly won in this case. Agreed, the pragmatic approach didn't help us to explain why the response times are sometimes slow, sometimes fast... but who cares, it's much faster now anyway, and we fixed an incorrect query in the process.
On a side note... P6Spy and IroneEye seem both rather dead. IroneEye can not even be downloaded anymore. Does anybody know a more alive (and free) alternative? I could probably look into the tools of the database vendors (MySQL in our case) themselves, but they are typically not free and maybe also not so easy to setup?