Wednesday, January 20, 2010

Performance tuning tools for a multi core world

Recently I was performance tuning a Java application on a quad core Intel Xeon X5560. This processor has hyper threading, so it shows up in the operating system as if it were 8 cores.

My application was slower than I wanted it to be, and I was faced with this vmstat output:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 1048660 209840 199608 4123568    0    0   150    24 1577 2599 13  9 77  1  0
 2  0 1048660 210220 199612 4123712    0    0   104    82 1593 2891 13 10 75  2  0
 1  1 1048660 209260 199628 4124596    0    0   484   210 1537 2567 10  7 77  6  0
 1  1 1048660 213940 199612 4120184    0    0  1232    32 1548 2694  9  7 76  8  0

At first sight, I thought it looked ok. That's between 9 and 13 % CPU time in user space, between 7 and 10 % in system space, and 75 to 77 % is idle. So it looked like my application is not using much of the CPU at all...

My application was still slower than I wanted it to be, so I went on looking for reasons why it was not using more of the CPU. I took some thread dumps and used the YourKit profiler to measure the time threads were waiting for locks. I couldn't find anything.

Then I had a look at what "top" had to say, in stead of using vmstat. It showed comparable numbers:
Cpu(s): 14.6%us, 10.5%sy,  0.0%ni, 73.2%id,  1.2%wa,  0.0%hi,  0.5%si,  0.0%st

But it says "Cpu(s)", so I hit "1" to get a view per CPU... now things started to get interesting:
Cpu0  : 12.7%us,  8.7%sy,  0.0%ni, 73.3%id,  3.3%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu1  : 13.2%us,  8.3%sy,  0.0%ni, 77.5%id,  0.7%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu2  : 14.3%us,  8.3%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 18.0%us,  9.7%sy,  0.0%ni, 70.7%id,  0.7%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu4  : 18.0%us, 17.3%sy,  0.0%ni, 64.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 15.3%us, 12.0%sy,  0.0%ni, 72.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 11.6%us,  8.9%sy,  0.0%ni, 79.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 13.3%us,  9.0%sy,  0.0%ni, 77.3%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st

Now of course, the problem was staring me in the face. The CPU times of vmstat and top (and many other tools) are relative to the total amount of CPUs. With 8 cores, this means that (for a single thread) 12.5% represents a CPU usage of 100% on only one CPU. With this in mind, the numbers of vmstat start to make sense. They mean that more or less 1 CPU is fully busy in user space, and one other CPU is almost fully busy in system space. In other words, my application would probably be faster if it did less I/O and more concurrent processing.

It also means that, with more and more CPUs, tools like vmstat are becoming less usefull. With only one CPU, you get a scale from 0 to 100. With 8 cores, the scale is reduced from 0 to 12. The granularity of the output of vmstat is 1 percentage, so with 100 cores the difference between some random "noise" on an otherwise idle machine and an idle machine except for 1 core which is fully used, will become invisible.

Its not only vmstat of course, the output of the CPU profiling in YourKit for example, has the same problem. The graph stays around 12.5%:


So to conclude:

I think we might need to update our profiling tools. We can also add new tools to our profiling toolbox. I experimented a little with mpstat for example, which gives much of the same information as vmstat, but does it per CPU.

And of course, as we all already know, we also also have to update our applications. Concurrency is not going away any time soon.

Sunday, January 3, 2010

A Scala library for IPv4 related concepts

I've been using the Xmas holidays to write lots of Scala. One of the things I wrote, partially just to learn some more Scala, partially because I believe it can be useful, is a library to work with IPv4 related concepts: IP addresses, IP networks, ranges of IP addresses etc.

In this post, I'll give a short overview of the library, and share some thoughts about my Scala experience while writing it.

The library exposes a few core concepts which can be useful when doing calculations with IP addresses etc.

IpAddress

IpAddress represents an IPv4 address.
val address = new IpAddress("192.168.0.1")
IpAddress can be used to make calculations on IP addresses, such as finding the next address.
val address = new IpAddress("192.168.0.1")
val next = address + 1
println(next == new IpAddress("192.168.0.2")) // prints true

IpAddressRange

IpAddressRange represents a continuous range of consecutive addresses.
val range = new IpAddressRange(new IpAddress("192.168.0.1"), new IpAddress("192.168.0.5"))
println(range.contains(new IpAddress("192.168.0.3")) // prints true

IpNetworkMask

IpNetworkMask represents a network mask, to be used in an IpNetwork.
val mask1 = new IpNetworkMask("255.255.255.128")
val mask2 = IpNetworkMask.fromPrefixLength(25)
println(mask1 == mask2) // prints true

val invalid = new IpNetworkMask("255.255.255.100") // throws exception

IpNetwork

An IpNetwork is a range (extends IpAddressRange) that can be expressed as a network address and a network mask.
val network1 = new IpNetwork(new IpAddress("192.168.0.0"), new IpNetworkMask("255.255.255.0"))
val network2 = new IpNetwork("192.168.0.0/24")
println(network1 == network2) // prints true

IpAddressPool

An IpAddressPool is like a range (extends IpAddressRange) of which certain addresses are "allocated" and others are "free". It could for example be used as part of a DHCP server implementation.
var pool = new IpAddressPool(new IpAddress("1.2.3.4"), new IpAddress("1.2.3.10"))
println(pool.isFree(new IpAddress("1.2.3.6"))) // prints true
pool.allocate(new IpAddress("1.2.3.6")) match {
    case (newPool, allocated) => {
            println(newPool.isFree(new IpAddress("1.2.3.6"))) // prints false
    }
}

And Much More

This was only a short introduction. Much more can be done with these types. Have a look at the scaladoc to get an idea of the possibilities.

What I liked about Scala

I became a fan of Scala some time ago already, but there were a few things which struck me as especially nice or elegant while writing this library:
Lazy lists
The IpAddressRange class has a method to list all addresses in the range. Also, the IpAddressPool class has methods to list all the free, and all the allocated addresses. With big ranges, it becomes virtually impossible to return an array or so containing all the addresses. Scala has the Stream concept to deal with this. A Stream is effectively a "lazy list". In other words, a List of which the elements are only evaluated when they are needed. With a recursive implementation, this results in very elegant code:
def addresses(): Stream[IpAddress] = {
    if (first < last) {
        Stream.cons(first, new IpAddressRange(first + 1, last).addresses)
    } else {
        Stream.cons(first, Stream.empty)
    }
}
Functional or Procedural
Although I try to write most of my scala code as "functional" as possible, sometimes it is nice to be able to fall back on a procedural implementation. To do calculations on IP addresses and network masks, a lot of "bit twiddling" is required. I found it usually easier to write that in a procedural style than purely functional (although that could be due to my limited functional experience as well), e.g.:
def fromLongToPrefixLength(value: Long): Int = {
    val lsb: Long = value & 0xFFFFFFFFL
    var result: Int = 0
    var bit: Long = 1L << 31

    while (((lsb & bit) != 0) && (result & 32)) {
      bit = bit >> 1
      result += 1
    }
    result
}
Option
The Option type needs no introduction, but it's so much more fun than using null values all over the place.
Immutability
Scala favours immutable types. This naturally steered me towards making all types in my library immutable as well. Because you can rely on immutable data structures in the Scala library (List, SortedSet, ...), it is much easier than in Java to get it right.
What I might dislike about Scala
h I encountered a few things which seemed a bit awkward or strange. Most of the issues are due to my lack of experience, so I'll give myself some more time to figure out whether they are really things I don't like or not.
Tuples
Tuples for example, are a concept that I am in doubt about. I used them extensively in the IpAddressPool class. For example, when allocating an address in a pool, the method returns an Option of the allocated address (None if the address was not free) and the new pool (because the IpAddressPool is immutable):
def allocate(toAllocate: IpAddress): (IpAddressPool, Option[IpAddress]) = {...}
This is really easy on the implementation side, but I'm not sure whether I like the client code so much. You either have to access the values in the tuple with "._1" and "._2", which doesn't read easy, or you have to pattern match, which seems a bit verbose. You can also assign the result to a new tuple directly
val pool = new IpAddressPool(new IpAddress("1.2.3.1"), new IpAddress("1.2.3.10"))
val (newPool, allocatedAddress) = pool.allocate(new IpAddress("1.2.3.6"))
but it doesn't seem to be possible to assign the result of a subsequent invocation again to the same tuple (using a var)
val pool = new IpAddressPool(new IpAddress("1.2.3.1"), new IpAddress("1.2.3.10"))
var (newPool, allocatedAddress) = pool.allocate(new IpAddress("1.2.3.6"))

// this won't compile
(newPool, allocatedAddress) = pool.allocate(new IpAddress("1.2.3.7")
Collections
I've had some usability issues with the collections library. The collections library is being worked on for the upcoming scala 2.8 version, so things will definitely improve.
I also found a bug in the TreeSet implementation of SortedSet, causing the tree to be heavily unbalanced which results in StackOverflowErrors when recursively traversing it (I hope this can get fixed for 2.8 as well).

Conclusion

Besides these minor issues, it was a very positive experience. I happen to have a Java version of this library as well, and I certainly like the scala version much better (both in using and in writing it).

Wednesday, December 9, 2009

The Dreadful JMX Tunneling Problem

JMX is notoriously difficult to access remotely. It relies on RMI which means that you need a connection to an RMI registry and a separate connection to an RMI server. The port (and host name) to be used for the second connection are assigned by the RMI registry, and can thus not be "guessed" in advanced.  This makes it almost impossible to setup the necessary port forwarding or ssh tunnels to access JMX on a server which is not directly in your network.

We used to use JBoss, which provides a html wrapper around the MBean server ("jmx-console"), thus allowing easy access to our MBeans. It's not as nice and graphical as JConsole, but at least http is easy enough to tunnel through ssh or firewalls.

After switching to tomcat (for various other reasons), I was looking for a solution to this "access our MBeans through ssh tunnels"-problem. There are all kinds of workarounds available to make RMI "tunnelable', but everything I tried came with its own set of problems (for example having to append stuff on the class path of JConsole).

So if tunneling turned out not to be a viable option, I started looking for html based alternatives like the "jmx-console" from JBoss. I found a couple of them:
JManage looks very nice and feature complete but it has the downside of being a standalone application running an embedded jetty rather than something I can simply deploy in tomcat.
• On some mailing list I found somebody who once created an application called "jmx-console" which does what I want. I tried it and it seemed to work OK, but it's an old and unmaintained project.
Coldbeans has an application written entirely in JSP. When testing it, it found it somewhat cumbersome to work with though, and I couldn't seem to access all my MBean operations (as if some of them where "read only").

During my search, I learned that the original reference implementation for JMX (or now in http://opendmk.dev.java.net/) contains a class named HtmlAdaptorServer, which didn't make it into JDK 5 when JMX was first introduced. The reference implementation is still usable in itself though (in maven2 repository as com.sun.jdmk:jmxtools:1.2.1). This HtmlAdaptorServer does exactly what I need: expose MBeans in a straightforward html view. I simply instantiate it in a ServletContextListener like this:

final HtmlAdaptorServer adapter = new HtmlAdaptorServer(port);
ManagementFactory.getPlatformMBeanServer().registerMBean(adapter, new ObjectName("my.domain:name=htmladapter,port=8000"));
adapter.start();

Although this is a good enough solution for me right now, I still think some configuration options to make JMX/RMI easily 'tunnelable' (being able to fix the port and host name would probably be enough) are something we should consider adding to the JDK...

Thursday, October 29, 2009

A performance tuning story

Recently at work, we were performance testing a few use cases of an application which are completely bottlenecked by the database. Actually, we were trying to figure out how much data we could input in the system, while sustaining the required response times. The "input into system" use cases obviously involves a few "insert" operations on the database, but also a few "select" queries on what is already in the database for verification of the inserted data etc. Initially we simply wanted to see how the performance of this "input into system" use case degrades when the database gets filled more and more.

A colleague of mine performed the tests (thanks Bart, if you are reading this), and created some nice graphs of the results. One of these graphs looks like this:

On the vertical axis you can see the response times. On the horizontal axis the number of "input into system" operations that have already been performed.

You notice that, as expected, the performance degrades as the database is filled with more data. But something strange is visible: the response times bounce up and down between some kind of minimum and maximum. Response times for individual requests seem to be either "slow", either "fast". We were convinced that if we could find the cause of the response times sometimes being slow, we could fix it, and all response times would be on the lowest line in the graph.

This is where the discussion started to diverge. We started thinking about all kinds of possible explanations as to why this could be the case. Many graphs where produced and many tests conducted, to try and figure out what the underlying cause could be.

Although this kind of reasoning is highly enjoyable, and watching all these graphs is very intriguing indeed, I proposed a more pragmatic approach: lets just find the slowest query in the system and try to improve that one... without bothering about trying to explain the strange graphs. Or as Kirk Pepperdine would say: "big gains first".

A quick check with vmstat confirmed our initial idea that the database was indeed the bottleneck. The MySQL process was using all of the CPU (user time). This also meant that I/O was not the problem. MySQL probably had all the data cached in memory, but some query (or queries) demanded lots of CPU time to be executed.

Next we added logging of the database queries being executed using P6Spy. We let the test run for a few minutes and then opened the log in IroneEye. This gives a nice overview of all the queries that have been executed, with their execution times. One query came out as being about 5 times slower as the other ones. We had a look at the actual query, and it turned out not only to be slow but also incorrect. The corrected version of the query had to do less joins, and was much faster. Running the test again resulted in a constant linear degradation of the response times, as expected.

Actually, when running the test for a longer time, the bouncing up and down pattern came back. This time it was much less severe though, and it happened only when much more data was already in the system. We'll probably have to hunt down the next slowest query now, in order to try and improve the situation even more.

Taken all together, it took us only 30 minutes or so to perform the test with P6Spy, analyze the logs with IroneEye, find the slowest query, fix the query and run the test again. Compared to endless discussions about possible explanations for the strange behaviour, the pragmatic approach clearly won in this case. Agreed, the pragmatic approach didn't help us to explain why the response times are sometimes slow, sometimes fast... but who cares, it's much faster now anyway, and we fixed an incorrect query in the process.

On a side note... P6Spy and IroneEye seem both rather dead. IroneEye can not even be downloaded anymore. Does anybody know a more alive (and free) alternative? I could probably look into the tools of the database vendors (MySQL in our case) themselves, but they are typically not free and maybe also not so easy to setup?

Tuesday, August 11, 2009

Mysterious JVM crashes explained

We've recently suffered from mysterious crashes of the JVM running our application server. The JVM (Sun JDK 1.5.0_06 but it was later reproduced with JDK 1.6.0_14 as well) simply crashed with all sorts of memory related errors. A few examples:

java.lang.OutOfMemoryError: requested 8388608 bytes for uint64_t in /BUILD_AREA/jdk1.5.0_06/hotspot/src/share/vm/utilities/growableArray.cpp. Out of swap space?
java.lang.OutOfMemoryError: requested 157286476 bytes for JvmtiTagHashmapEntry* in /BUILD_AREA/jdk1.5.0_06/hotspot/src/share/vm/prims/jvmtiTagMap.cpp. Out of swap space?
java.lang.OutOfMemoryError: requested 16 bytes for CHeapObj-new. Out of swap space?

The server was not out of swap space and it had plenty of free memory left, so that was not the problem.

I started googling around and found lots of different mentions of these kinds of problems, but there turned out to be many possible causes.

In our case, the problem turned out to be that we allowed the JVM to use too much memory... The JVM was configured with a 2GB heap (and 256MB of perm size), and was running on a 32bit machine with 4GB of memory (linux 2.6.x kernel without hugemem or anything like that). As it turns out, in this kind of configuration, linux allows each process to allocate at most a little over 3GB. When our application was requiring a lot of memory, the JVM was trying to (perhaps temporary, to perform garbage collection) allocate more memory than it is allowed to by the linux kernel, resulting in one of the above types of crashes.

Configuring the JVM with a smaller heap (1.5GB) caused our application to simple go out of memory with a regular "java.lang.OutOfMemoryError: Java heap space". But now at least, it didn't just crash, and after a few hours with a memory profiler (YourKit), the cause of the excessive memory usage in our application was found as well.

I still wonder if the JVM or the linux kernel could arrage for a more meaningful error message though.