Tuning Espresso’s JVM Performance

Shawn Nguyen

Principal Architect

January 15, 2016

Co-authors: Shawn Nguyen and Amit Balode

Profile in Espresso

The member profile page is one of the most-visited pages on LinkedIn. It allows other members to learn about a member's professional identity, experiences and interests. It’s crucial to make sure that profile views return as fast as possible. In this post, we’ll talk about our approach for handling profile traffic with Espresso and how we make sure our members’ profiles retrievals return within milliseconds.

Background

Profile retrievals can easily climb to several hundreds of thousands of queries per second in a single data center. However, whenever a traffic shift is performed (when traffic is shifted from one datacenter to another), an extra load is placed on the data center receiving additional traffic, causing an uptick in QPS and ultimately, latency times. This can cause requests to eventually time out, resulting in slow profile retrieval times that in turn cause a delay for all components that depend on profile retrievals, generating a cascading effect on the entire website’s performance.

There are many ways that a request can get bogged down throughout the request life cycle, but when an application’s performance is not tuned properly, problems can start to manifest quickly when large numbers of requests are batched and submitted. There are many existing blog posts on CMS (concurrent mark sweep) tuning for Java garbage collection, including Swapnil Ghike’s excellent post on GC tuning for Linkedin’s feed. We’ll briefly describe our garbage collection performance issues at the data routing layer during peak traffic, followed by our motivations to moving from the CMS collector to the G1 collector. We’ll conclude with our ongoing work to further optimize the router.

Router issues with CMS collector

We store profile data in Espresso, Linkedin’s distributed, document-oriented, horizontally scalable, and highly available key-value datastore. When users make a read or write request to Espresso, a router first forwards the request to a storage node containing the requested data. The router uses R2 for connection pooling, load balancing, and HTTP transport, and it uses D2 for service discovery. The router maintains an asynchronous pool of connections to storage nodes to handle requests that are sent to multiple storage nodes but are then sent back as a single response. Routers also have the ability to set specific routing rules for read requests that can be sent to a combination of primary and secondary storage nodes. Most clients choose to read from primary storage nodes only to guarantee read after write consistency, however, clients also have the option of scaling reads by sending them to secondary nodes at the cost of reading stale data.

The Identity service, the system which serves profile data, has doubled its number of maximum queries per second over this past year to serve growing numbers of callers and to be able to handle traffic at peak hours during our traffic shift (when traffic from other data centers are redirected to the center that is being shifted to). The service itself makes hundreds of thousands of queries per second on the storage nodes. The Identity service has a mix of offline (Hadoop) and online (website) traffic. Offline traffic only happens during the start of the hour for a few minutes. Because of a combination of regular online traffic, additional campaigns from offline traffic, peak traffic hours, and a larger member base, the Java heap’s old generation started seeing a variety of small to large objects. The longer the service was alive, the more CMS cycles it incurred from ‘heap fragmentation’ issues within the JVM.

Earlier this year, we shifted roughly 75 percent of the website’s traffic into one data center. This resulted in an alarmingly large spike of 503 response codes returned from routers to upstream calls.

Large amounts of ParNew promotion failures triggered CMS on the JVM heap and caused large pause times on the routers. This forced a large number of clients to attempt multiple retries for connection establishment, resulting in TCP listen queue overflow (queue where fully established connections are kept before they are accepted by application), network traffic spikes, and huge router latencies. This further contributed to router timeouts and an increase in 503 response codes from Espresso. Storage nodes, on the other hand, had an uptick in latency times, but nothing out of the extraordinary was captured in terms of its performance. Two particular parameters that stood out on the current JVM settings on the routers were “CMSInitiatingOccupancyFraction=75, UseCMSInitiatingOccupancyOnly.”

This means that at 75 percent occupancy for old generation (2.25GB), CMS is supposed to get triggered. The image below shows that due to large fragmentation in old gen, a ParNew promotion failure was triggered when old gen was at around 1.9GB (350MB before concurrent cycle could trigger), causing a 2.8 second pause.

We initially thought that doubling the old generation size to 5GB would help with reducing overall fragmentation and collection rates.

We quickly realized that increasing the old generation capacity only delayed overall cleanup times. Let’s take a closer look to better understand what heap fragmentation means and how it occurs.

Heap Fragmentation in CMS collector

Most of the garbage collectors have a markup phase to find and mark live objects and a cleanup phase to reclaim dead objects’ memory. CMS is not a copy collector, meaning that during the sweep and cleanup phase, it just updates all unreachable objects in its free list. No compaction of live objects is done, and this creates holes in memory. The figure below shows how these holes lead to heap fragmentation.

In this figure, memory blocks are shown in chunks of 1KB. As the heap gets used up to roughly 65 percent capacity, CMS is triggered. CMS starts scanning objects from its root and marks all live objects. The rest of the unreachable objects (1KB unused memory) are added to its managed free list. We can see that just after the initial CMS cycle runs, holes are generated within contiguous blocks of memory. As more and more cycles pass, this fragmentation increases.

In the figure below, a request with a 4KB payload size is made. Even though there is sufficient memory in heap, the request allocation fails due to a lack of contiguous blocks that are 4KBs or larger available, resulting in a “ParNew” promotion failure.

CMS Heap configuration & tweaks

Currently, response sizes from profile retrievals range from 0 to 2MBs. That means an allocation of contiguous max (2MB/(4KB page size)) of about 500 pages can be requested when objects that large get promoted from NewGen to OldGen. CMS starts with a single contiguous block of memory, and over time, with various sizes of objects allocated, fragmentation starts to build up. At some stage, if a request is made with a response payload of 2MBs or greater, and CMS does not have that much contiguous memory, a full garbage collection cycle will be triggered. Full GC (garbage collection) means a complete halt of the application to clean up its heap. For us, Full GC durations have always lasted more than a second. Increasing memory was not a solution because we would simply see longer garbage collection pauses from CMS.

Original CMS parameters: (MaxHeapSize = 4GB, NewSize = 1.2GB, MaxNewSize = 1.2GB, OldSize = 2.8GB, Eden Space: capacity = 1.2GB, From/To Space: capacity = 50MB)

MaxTenuringThreshold = 16; Average 95 percent of the objects died young in 1st collection and 5 percent got to Survivor space. Out of this 5 percent, 90 percent die before age 15 and 10 percent got promoted to old gen. Increasing the threshold any further did not benefit much because after age 6 or 7, the death ratio is more or less stable. There was not much compromise in throughput/ latency with this increase but it did not lower the amount of ParNew promotion failure.
MaxHeapSize = 4GB; We tried increasing the max heap size to 5GB and 10GB, but it deferred the problem rather than solving it, causing high pause times.

Motivations for moving to G1

In general, the CMS collector wasn’t working out for our performance needs. G1 seemed to have clear advantages, noted below:

CMS is not a copy collector, so after marking and remarking is completed, it does not perform compaction, which eventually starts creating holes in memory and further fragmentation. G1, however, is a copy collector since it performs compaction by copying objects from various regions into a single region.
In G1, the heap is organized into regions so the effects of fragmentation are much less ‘visible’ than CMS. Even when fragmentation occurs, it only affects particular sets of regions, instead of the entire old gen.
Also, long pauses are common with CMS since it has to scan the entire heap for live objects. With G1, the scanning portion is done by regions concurrently, so pause times are based on regions collected rather than the entire heap size. Each collection returns several completely empty regions.

G1 Tuning

G1 was canaried (deployed on a few machines for testing purposes) with a 5GB heap (1GB higher than CMS, to account for additional data structures used in G1 for region management). Router nodes were running on Java 8, and consisted of Intel Xeon E5-2640 chips with 24 cores at 2.5GHz and 64 GBs of memory.

1) Heavy Reference Processing

G1 occasionally behaves odd with soft/weak references cleanup. Below you can see in young generation cleanup, “Ref Proc” spiked to 101.3 ms which caused a reduction in Eden’s size. This caused too many premature promotions to OldGen, resulting in multiple concurrent GC runs in a short span.

2015-06-10T12:00:01.854+0000: 711685.290: [GC pause (G1 Evacuation Pause) (young) [Ref Proc: 4.0 ms] [Eden: 2904.0M(2904.0M)->0.0B(2872.0M)
2015-06-10T12:00:05.899+0000: 711689.335: [GC pause (G1 Evacuation Pause) (young) [Ref Proc: 101.3 ms] [Eden: 2872.0M(2872.0M)->0.0B(216.0M) Survivors: 32.0M->40.0M Heap: 4570.1M(5120.0M)->1706.8M(5120.0M)]

We mitigated this issue by setting G1NewSizePercent to a minimum of 40 percent. By setting this, if reference processing takes longer than normal processing time, the Eden size won’t fall below the specified threshold, avoiding premature promotions. We also added the ParallelRefProcEnabled flag to enable multi-threaded reference processing during remark phase.

2) Lazy cleanup with ThreadLocal variables

In the ChannelSink implementation, a ThreadLocal object would be dynamically created that stored a Boolean object. The problem was that many entries would accumulate on the Map within ThreadLocal, one for each thread that attempted to transmit a packet out on that ChannelSink, each storing a Boolean.FALSE value. The purpose of that ThreadLocal was to check if the channel I/O was attempting reentrancy.

In the pre-async router application, there were 1000 threads; so each ChannelSink would end up with a ThreadLocal containing 1000 entries. On a system with up to 1000 Channels, there would be a million Map.Entry objects. The solution was a new Netty handler which passes off further execution of the packet outbound message to the thread associated with the netty Channel. This reduced the overall number of entries in that ThreadLocal to (typically) one.

Netty 3's implementation of ChannelLocal had each ThreadLocal object contain a WeakMap keyed by a weak reference to the Channel. Instead, we put a HashMap inside Channel to contain the actual EspressoChannelLocal instance as the key (strong reference), which didn't have any weak references. This change reduced the overall amount of weak references within the router application. The additional benefit of this is that when a Channel is garbage collected, all associated data with the EspressoChannelLocal instance is garbage collected as well.

3) G1 Region Size

G1 was rolled out with the default region size. Shortly after, we started seeing humongous allocations! For G1 GC, objects are considered “humongous” if they span 50% or more of a region’s size. A humongous allocation needs a contiguous set of regions and are allocated directly into ‘humongous regions’ residing in old gen’s space. Before JDK1.8_40, the cleanup of humongous objects happened only during concurrent cycle, so we upgraded to JDK1.8_40, where cleanup can take place in early evacuation phases.

The default region size is 2MB, and our application’s max http response size was roughly 2MB. So by keeping region size greater than 2MB, we avoided humongous allocations. We used a value of 8MB for G1HeapRegionSize to account for unique cases where larger objects than usual are allocated to these regions. GCs that were triggered by humongous allocations disappeared and remark phase times dropped dramatically after rolling these changes out.

CMS vs G1 metrics

Below are graphs that show (in this particular order) the JVM’s heap usage, CPU time spent, and finally the GC pause times across a duration of data points scraped from GC logs for CMS and G1.

Overall CMS stats (collected from the data points shown in the graphs below):

In the graph above, we can see full CMS runs occurring at roughly 13:00, 20:20, and 21:30. As expected, CPU times spent during those timeframes spiked up to 115 ms during the remark phase (shown below) and corresponding spikes in GC pause times. There was a traffic shift before the 20:00 mark, causing heap usage to grow faster than usual, and eventually forcing another remark to run roughly an hour after.

In the graph above, GC pause times spiked to a maximum of 53 ms during remark.

Compared to CMS, G1 young generation collections were much more frequent and consistent, with just a single mixed GC versus three in CMS.

In comparison, G1 pause times and CPU usage times in the graphs above were more inline and consistent overall versus CMS. There was just a single remark initiated (shown at roughly 18:55), with a max of 75 ms spent in GC pause time and CPU time.

1) Pause times

a) CMS average GC pause time: ParNew collector pause time varies from an average of 20 to 25 ms.

b) G1 average GC Pause time: ParNew collector pause time varies from average 27 to 30 ms. (Note: although the average GC time of G1 is slightly greater than CMS, overall GC is running only once every 3 seconds in G1 whereas in CMS it runs every second. If average pause of CMS 25ms and G1 is 30ms, over 3 seconds CMS will see a pause of 75ms (3*25ms) and G1 will see a pause of 30ms (1*30ms). Thus overall, there will be gains in throughput.

2) Average collections per second
a) The CollectionCount metric indicate number of times GC has occurred in a second. In CMS, ParNew cycles were happening every second (max value below).

b) In G1, collections were happening every 3 seconds (300m in the graph translates to 0.33 times in a second or roughly once every 3 seconds).

3) Concurrent cycle frequency
a) Concurrent cycles runs when old generation reaches a certain capacity. For Identity, it was running every 6-8 hours in CMS.

b) For G1, the number of cycles didn't change much during peak hours. A cycle occurred every 6-7 hours.

Conncurrent cycles in G1 happen every 6-7 hours

4) Average remark pause time: Concurrent mark phase is where garbage collector traverse object graphs and mark live objects, and it's done in parallel with application threads. Remark phase is ‘stop the world’ phase where application threads are paused and garbage collector revisit the object graphs from concurrent mark phase to find modified references. Because the majority of application stall times are from remarks, it’s crucial to make tunings to reduce this remark time as low as possible. Average CMS remark pause times for Identity were around 150ms, but G1 was proven to be the most performant in this statistic, averaging 50ms per pause from remark.

Tuning beyond Identity?

By moving from CMS to G1 collector, we reduced our average pause time from 150ms to 50ms and avoided fragmentation issues within the JVM. The number of incidents for front-end latency spikes have significantly gone down as we are able to successfully handle twice as much peak traffic (during traffic shifts).

The tunings have proven to be tremendous in alleviating the performance issues we saw in our overall day-to-day operations of Espresso and maximizing throughput while minimizing latency. Our latencies are now averaging 30 ms response times for reads at peak traffic with these tunings for Identity. In the future, we will continue benchmarking performance on these routers with the current settings for new releases since code changes almost always change GC behavior. We are also in the process of measuring performance metrics for other clusters aside from Identity to see how their GC patterns would benefit from G1.

We’re currently in the process of evaluating G1 performance on Espresso’s storage layer by running similar tests against the nodes. There is work underway to create dedicated performance clusters to benchmark G1 GC behaviour by testing the storage nodes with simulated, production traffic patterns.

Acknowledgements

Many thanks to the performance team and Espresso Development & SRE team for collaborating and helping out with evaluating these performance tweaks, especially Joy and Jemiah!

Topics: Optimization Open Source Profile Data Management Infrastructure