Tuning Java Garbage Collection for Web Services

Jacob Kessler

June 13, 2011

Since I've been declared the local Java Garbage collection guru^[1], I've found myself answering the same kinds of questions over and over. "Why am I getting frequent GC pauses?", "How do I avoid long GC pauses?", "Concurrent mode failure - how bad is that?" In the spirit of teaching people to fish, I thought I'd get together a blog post on how you (yes, you!) can tune the garbage collector for something like a service at LinkedIn, leading to smooth running without performance-degrading lengthy pauses to clean up^[2]. Please note that this is going to be a guide to tuning for something similar to a LinkedIn service, and so we'll be making some assumptions about how the program is using memory in it. While the basic principles should be broadly applicable, the techniques and analysis may not function well on programs that use their memory differently.

So, the good news is that in many cases, it's enough to simply turn on CMS and leave it at that (Don't you love it when the defaults are good enough most of the time?). Of course, that's in many cases, not in all cases, and it's not good enough to have your website performing well in many cases. It doesn't help that the situations in which the default settings end up struggling are the times of highest load on the servers, which are of course the times when you least want to have issues. Another bit of good news is that once you've figured out what you're looking at in the full Java garbage collection logs^[3], it's still fairly easy^[4].

Crash course on reading GC logs

Young Generation collections (most of your total collections)

Log file

Explanation

Now: survivor spaces are somewhat tricky. There are two of them, but (ideally) you only use one at a time. They hold objects that have survived (meaning, have been reachable during) previous garbage collections, but we 'suspect' that they will die (meaning, become unreachable) soon, so the GC doesn't want to move them into the old generation where collecting them will be expensive. The important things to remember, for now, is that keeping things in the survivor spaces is expensive collection-time wise^[6], but better than moving to the old gen if the objects there are going to die in the next collection or two. At LinkedIn, where almost all of our memory use is for in-progress requests, we find that promoting things after two or three collections works best.

CMS (hopefully occasional)

CMS is a lot more verbose, so I've cleared out a lot of the 'unimportant' stuff from the logs. It's all useful, but for our purposes only a few of the phases matter.

Log file

Explanation

There's a bunch of other stuff in there (the concurrent mark and sweep), but it isn't as important - mostly what we will be paying attention to is when the collection starts (in terms of how full the old gen is) and how long it takes. If the old gen fills up before the sweep is done, then the VM needs to pause while it finishes the collection. For Ancient Historical Reasons, the collector that it relies on during the pause is the single-threaded old gen collector, which is almost never a good choice.

How to use the data in GC logs

Alright! Those of you who skipped should start reading again. Armed with our newfound knowledge of how to read GC logs, we are going to compute 6 numbers.

Allocation Rate: the size of the young generation divided by the time between young generation collections
The Promotion Rate: the change in usage of the old gen over time (excluding collections)
The Survivor Death Ratio: when looking at a log, the size of survivors in age N divided by the size of survivors in age N-1 in the previous collection
Old Gen collection times: the total time between a CMS-initial-mark and the next CMS-concurrent-reset. You'll want both your 'normal' and the maximum observed
Young Gen collection times: both normal and maximum. These are just the "total collection time" entries in the logs
Old Gen Buffer: the promotion rate * the maximum Old Gen collection time * (1 + a little bit)

Now, something that I always find myself reminding people of is that everything that the GC cleans up is something that was allocated by the program, and so in addition to tuning the GC you may want to tune the program itself. While that's obviously beyond the scope of this blog post, it's worth noting that application changes are the only way to change the allocation rate and survivor death ratio. Otherwise, you'll have to just deal with whatever you're getting there.

These numbers give us some important limits to tuning the heap. Firstly, you should make sure that you have at least your old gen buffer free after a CMS cycle. Ideally you'll have more, but if you don't you're likely to encounter concurrent mode failures under heavy load, which is a Bad Thing.

Knobs you can turn

Once that basic necessity is covered, there are three (well, technically we can increase or decrease them, so six, but you'll almost never want to decrease them) things that we can change to alter GC performance^[7]. We can:

Increase the young generation size: this allows more garbage to be generated before a collection is needed, and thus decreases the frequency of young generation collections. Because (assuming that you have a LinkedIn-like application) the vast majority of your allocations are for serving requests rather than persisted data, this tends to not affect either the required survivor size (since the number and size of requests in progress at the moment of collection doesn't change) or the collection time^[8]. Assuming that increasing the young gen size doesn't affect the size of surviving objects or the collection time, this will also reduce your promotion rate. However, it's worth noting that this can cause the young gen collection times to increase, so if they are already near what you consider to be acceptable you may want to be careful.
Increase the maximum survivor age: this (hopefully) means that more objects die before being promoted, reducing your promotion rate and thus the frequency of old gen collections and the needed old gen buffer. It can also mean that you need to allocate more survivor space^[9] to hold the surviving objects longer - if the survivor space overflows, it will promote directly to the old gen. Since you'll be copying more objects around, this can also increase your young gen collection time, though ideally not by much. Monitor your Survivor Death Ratio when tuning your survivor age - you typically want to start promoting to the old gen when it hits 50% or so, by which point it should be decently small. At LinkedIn, it's at age 2 or 3.
Increase the old gen size: this means that, regardless of how everything else is working, your old gen collections will happen less frequently. The time per collection will increase, and along with that the required buffer^[10], but thanks to CMS the pause times (initial mark and remark) should be fairly consistent. This is mostly useful to prevent yourself from running out of memory under heavy load, or to ensure that you have sufficient buffer, rather than something to actually increase normal performance. It does, however, make it much less likely that you will suffer a concurrent mode failure if you aren't able to prevent them with either of the other two suggestions.

Putting it all together

So, in summary, here are the instructions to tune your very own garbage collector!

Check to see if it actually needs tuning. If the pause times and frequencies aren't a problem, you likely won't gain anything by trying to tune it. You're done!
Compute the numbers. If the allocation or survivor death rates look like they are the problem, get your coders to write code that doesn't allocate memory so carelessly. Once they have assured you that they have done that, go to 1.
Figure out what it is you need to change (based on the above three things), make your change, and watch to see if it actually accomplished what you were trying to do. Go to 1.

Thank you for reading, hopefully you found this helpful and interesting.

Footnotes

[1] This title is entirely undeserved
[2] And, of course, without all of the performance-degrading bugs that show up in large-scale programs in languages without memory management
[3] By which, of course, I mean logging with (on the Hotspot VM, which you're probably using) -XX:+PrintTenuringDistribution -XX:+PrintGCDetails -XX:+PrintGCDateStamps -verbose:gc in your command line arguments
[4] Yes, there are plenty of tools out there that mean that you don't need to read the actual raw logs. However, I've yet to find one that doesn't lose some of the resolution and details that you need to diagnose rare situations. Maybe I'm just old fashioned or something.
[5] Technically, this is the eden size, rather than the young gen size. If you know enough to be able to complain about that, you probably shouldn't be reading this section
[6] Survivor spaces are expensive because each collection, all surviving objects are moved from the active survivor space to the inactive one (which then becomes the active one, and the newly-inactive survivor space is erased), which makes the logic of keeping track of what is free and what is not much easier. It also means that if you're keeping 200MB of data in there, you're copying 200MB of memory each collection, which takes time.
[7] This isn't quite true, but for the purposes of introductory GC tuning on LinkedIn-like applications, these are the three that are likely to have consistent and noticeable effects, and the same way that the defaults are fine for the great majority of applications, these three will be fine for the great majority of the remainder.
[8] The young gen collector copies everything that is still alive out of the young gen (to either a survivor space or the old gen), and then declares the entire young gen to be free. Thus, collection time is dependent on the size of surviving objects, not on total size.
[9] By decreasing the -XX:SurvivorRatio setting, among the most confusing Hotspot tuning options in existence. Read -XX:SurvivorRatio=x as "For every x bytes of young gen, allocate one byte of survivor space, which will then be evenly split between the two survivor spaces.".
[10] You may find that you need to start using the -XX:CMSInitiatingOccupancyFraction option as well here, to get CMS to kick off as you have only your buffer amount free, rather than when the internal ergonomics think that it should.