Voldemort on Solid State Drives
Project Voldemort is an open source implementation of Amazon Dynamo. Voldemort handles a big chunk of traffic at Linkedin, including applications like Skills, People You May Know, Company Follow, Linkedin Share, serving thousands of requests per second over terabytes of data.
At the beginning of this year, we migrated our Voldemort clusters to SSD (Solid State Drives) from SAS (Serial Attached SCSI) disks, to meet increasing demand for IOPS from data intensive applications. Immediately, we observed sharp increases in the 95th percentile latency. In the charts below, the green line represents SSD performance and purple line represents SAS performance, as perceived by a Voldemort Client in our performance test infrastructure.
As described below, we were able to pinpoint the causes of the latency increase to Java Garbage collection issues.
Linux stealing JVM pages
By lining up the Linux page scans against the system time component of minor collections (labeled “Stall in Linux”), we were able to discover that since the JVM heap was not locked down with
mlock(), Linux was constantly paging out JVM pages to swap space. As a result, whenever the promotion into old generation happened, it incurred the cost of heavy page scans to bring back the swapped out JVM pages, resulting in 4 second minor pauses.
Also, when testing performance in the lab, we realized the importance of using the
AlwaysPreTouch JVM flag. In a production environment, which runs for many days, the virtual address space of a production Voldemort server heap gets mapped to physical addresses over time. However for the shorter experiments in the lab, the server heap is not warmed up to the same extent, so we see higher GC pauses.
We were also facing promotion failures during our daily data cleanup job, which typically indicates heavy fragmentation of the old generation. By building a distributed workload tool that can simulate traffic using production datasets, we concluded that high speed SSDs combined with multi-tenancy in our deployments was the root cause.
Voldemort cleans up expired key-value pairs once a day, which involves scanning over the entire dataset in the storage engine (Berkeley DB JE). This scan brings objects of varied sizes into our storage engine's LRU cache, marking the existing cache items as garbage in the old generation. The combination of rapid 200MB promotions and an equal amount of garbage generation becomes too much for CMS (Concurrent Mark and Sweep) to handle.
We solved the problem by forcing the storage engine to not cache the objects brought in during the cleanup job, thereby drastically reducing the promotion rate. This experience brings out a general fragmentation pitfall when using SSD.
With these fixes in place, we are finally able to leverage the full power of Solid State drives. Here are the results from our lab tests:
- On SSDs, we believe that the garbage collection overhead will have a significant performance impact for Java based systems, since IO is no longer the bottleneck.
- Even for non Java systems or Java systems that don't rely on GC for memory management, fragmentation risks are higher. Fast SSDs cause fragmentation to be created at a much faster rate, subsequently increasing cost of defragmentation.