Voldemort on Solid State Drives

May 9, 2012


Co-authored by Lei Gao, Cuong Tran

Project Voldemort is an open source implementation of Amazon Dynamo. Voldemort handles a big chunk of traffic at Linkedin, including applications like Skills, People You May Know, Company Follow, Linkedin Share, serving thousands of requests per second over terabytes of data.

At the beginning of this year, we migrated our Voldemort clusters to SSD (Solid State Drives) from SAS (Serial Attached SCSI) disks, to meet increasing demand for IOPS from data intensive applications. Immediately, we observed sharp increases in the 95th percentile latency. In the charts below, the green line represents SSD performance and purple line represents SAS performance, as perceived by a Voldemort Client in our performance test infrastructure.

Comparison of 95th latency (in ms), SSD vs SAS, post migration
Comparison of 95th latency (in ms), SSD vs SAS, post migration

As described below, we were able to pinpoint the causes of the latency increase to Java Garbage collection issues.

Linux stealing JVM pages

By lining up the Linux page scans against the system time component of minor collections (labeled “Stall in Linux”), we were able to discover that since the JVM heap was not locked down with mlock(), Linux was constantly paging out JVM pages to swap space. As a result, whenever the promotion into old generation happened, it incurred the cost of heavy page scans to bring back the swapped out JVM pages, resulting in 4 second minor pauses.

End-End Correlation of Linux Page scan vs Minor collection
End-End Correlation of Linux Page scan vs Minor collection

Also, when testing performance in the lab, we realized the importance of using the AlwaysPreTouch JVM flag. In a production environment, which runs for many days, the virtual address space of a production Voldemort server heap gets mapped to physical addresses over time. However for the shorter experiments in the lab, the server heap is not warmed up to the same extent, so we see higher GC pauses.

Fragmentation

We were also facing promotion failures during our daily data cleanup job, which typically indicates heavy fragmentation of the old generation. By building a distributed workload tool that can simulate traffic using production datasets, we concluded that high speed SSDs combined with multi-tenancy in our deployments was the root cause.

Voldemort cleans up expired key-value pairs once a day, which involves scanning over the entire dataset in the storage engine (Berkeley DB JE). This scan brings objects of varied sizes into our storage engine's LRU cache, marking the existing cache items as garbage in the old generation. The combination of rapid 200MB promotions and an equal amount of garbage generation becomes too much for CMS (Concurrent Mark and Sweep) to handle.

Interplay between BDB cache and Java GC
Interplay between BDB cache and Java GC

We solved the problem by forcing the storage engine to not cache the objects brought in during the cleanup job, thereby drastically reducing the promotion rate. This experience brings out a general fragmentation pitfall when using SSD.

Takeaways

With these fixes in place, we are finally able to leverage the full power of Solid State drives. Here are the results from our lab tests:

Comparison of avg latency (in ms), SSD vs SAS, today
Comparison of avg latency (in ms), SSD (green line) vs SAS (purple line), today
Comparison of 95h latency (in ms), SSD vs SAS, today
Comparison of 95th latency (in ms), SSD (green line) vs SAS (purple line), today
  • On SSDs, we believe that the garbage collection overhead will have a significant performance impact for Java based systems, since IO is no longer the bottleneck.
  • Even for non Java systems or Java systems that don't rely on GC for memory management, fragmentation risks are higher. Fast SSDs cause fragmentation to be created at a much faster rate, subsequently increasing cost of defragmentation.

We presented some of our findings in WBDB 2012. Our paper and slides can be found on Slideshare.

Topics