Wednesday, January 20, 2010

Performance tuning tools for a multi core world

Recently I was performance tuning a Java application on a quad core Intel Xeon X5560. This processor has hyper threading, so it shows up in the operating system as if it were 8 cores.

My application was slower than I wanted it to be, and I was faced with this vmstat output:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 1048660 209840 199608 4123568    0    0   150    24 1577 2599 13  9 77  1  0
 2  0 1048660 210220 199612 4123712    0    0   104    82 1593 2891 13 10 75  2  0
 1  1 1048660 209260 199628 4124596    0    0   484   210 1537 2567 10  7 77  6  0
 1  1 1048660 213940 199612 4120184    0    0  1232    32 1548 2694  9  7 76  8  0

At first sight, I thought it looked ok. That's between 9 and 13 % CPU time in user space, between 7 and 10 % in system space, and 75 to 77 % is idle. So it looked like my application is not using much of the CPU at all...

My application was still slower than I wanted it to be, so I went on looking for reasons why it was not using more of the CPU. I took some thread dumps and used the YourKit profiler to measure the time threads were waiting for locks. I couldn't find anything.

Then I had a look at what "top" had to say, in stead of using vmstat. It showed comparable numbers:
Cpu(s): 14.6%us, 10.5%sy,  0.0%ni, 73.2%id,  1.2%wa,  0.0%hi,  0.5%si,  0.0%st

But it says "Cpu(s)", so I hit "1" to get a view per CPU... now things started to get interesting:
Cpu0  : 12.7%us,  8.7%sy,  0.0%ni, 73.3%id,  3.3%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu1  : 13.2%us,  8.3%sy,  0.0%ni, 77.5%id,  0.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 14.3%us,  8.3%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 18.0%us,  9.7%sy,  0.0%ni, 70.7%id,  0.7%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu4  : 18.0%us, 17.3%sy,  0.0%ni, 64.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 15.3%us, 12.0%sy,  0.0%ni, 72.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 11.6%us,  8.9%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 13.3%us,  9.0%sy,  0.0%ni, 77.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st

Now of course, the problem was staring me in the face. The CPU times of vmstat and top (and many other tools) are relative to the total amount of CPUs. With 8 cores, this means that (for a single thread) 12.5% represents a CPU usage of 100% on only one CPU. With this in mind, the numbers of vmstat start to make sense. They mean that more or less 1 CPU is fully busy in user space, and one other CPU is almost fully busy in system space. In other words, my application would probably be faster if it did less I/O and more concurrent processing.

It also means that, with more and more CPUs, tools like vmstat are becoming less usefull. With only one CPU, you get a scale from 0 to 100. With 8 cores, the scale is reduced from 0 to 12. The granularity of the output of vmstat is 1 percentage, so with 100 cores the difference between some random "noise" on an otherwise idle machine and an idle machine except for 1 core which is fully used, will become invisible.

Its not only vmstat of course, the output of the CPU profiling in YourKit for example, has the same problem. The graph stays around 12.5%:


So to conclude:

I think we might need to update our profiling tools. We can also add new tools to our profiling toolbox. I experimented a little with mpstat for example, which gives much of the same information as vmstat, but does it per CPU.

And of course, as we all already know, we also also have to update our applications. Concurrency is not going away any time soon.

1 comment:

Anonymous said...

Good point. The tools for multicore performance are severely limited. I wonder why not many people seem to be affected by it