Taming memory fragmentation in Venice with Jemalloc
January 28, 2021
Sometimes, an engineering problem arises that might make us feel like maybe we don't know what we're doing, or at the very least, forces us out of the comfort zone of our area of expertise. That day came for the Venice team at Linkedin when we began to notice that some Venice processes would consume all available memory and crash if given enough time to run. This ended up being due to memory fragmentation. In this blog post, we'll explain the different ways we tried to diagnose the problem, and how we solved it.
What is Venice?
Venice is LinkedIn's platform for serving derived data. Venice was designed for very low latency lookups while also supporting high throughput ingestion of data. Reads are served directly from Venice, while writes enter the system asynchronously, either pushed as full versioned data sets, or as incremental updates coming from either Hadoop or stream sources like Samza. At its core, Venice is a sharded multi-tenant clustering software that leverages Kafka and RocksDB to power high throughput lookups for LinkedIn features (like People You May Know).
You can learn more about Venice here.
The symptom of the problem
As mentioned earlier, we noticed that if we left a server node alone for long enough, it would end up crashing. We generally try to release Venice to production on a weekly cadence to align with our development cycle, but we began to notice that if we slipped and left the software running for longer (a few weeks), things started to get scary. Nodes would go offline across clusters in batches. OS system metrics and logs seemed to point to a growing lack of available memory. Moreover, we had error messages in backtraces from generated core dumps and hs_err_pid logs that all seemed to indicate failures on allocations, as well as sometimes having lines from dmesg telling us that OOM killer had stepped in. With some added metrics, we could see that, over time, resident memory (RSS) would trend steadily upward along with virtual memory (VSS).
A memory leak!
A lot of software engineers have played this game before. There are plenty of tools and strategies out there for puzzling memory leaks, and many an article offering succor for such maladies. I am a Java developer, and I'm used to working with tools like visualvm and OQL. This influenced our initial track of investigation. We grabbed some heap dumps and got to digging.
visualvm has a nice feature where you can compare two heap dumps and see what the difference is in memory footprint and object counts. If hooking up a profiling tool to a production service is problematic, a handy alternative can be to start the service, grab a heap dump, let it run for a while until you see that the process has hit its “degraded” state, and grab another heap dump at that point. You can then plug both dumps into visualvm and get a report on the difference of objects.
But we didn’t find anything. Unfortunately though, this process didn't tell us very much. We were not seeing an accumulation of objects in the heap that would indicate that we were leaving around an increasing volume of objects. Moreover, the heap sizes weren't actually getting bigger over time.
Could it be something off-heap?
jcmd is another great tool that comes bundled with Java. It has utilities for tracking native memory allocations (i.e., off-heap memory allocations made by the JVM). You can execute it on your process with the VM.native_memory command, and get a report on the memory usage on your process. You can also add the baseline argument and you can set a starting point on what the memory usage is at time of execution, so you can come back later and see what the difference is. You'll get a nice report that looks like the following:
jcmd <pid> VM.native_memory
jcmd was more succinct than the initial heap dump approach in telling us that there wasn't anything there to help us find our memory leak, but it did get us incrementally closer. We at this point knew two things:
The process memory footprint is growing.
The memory growth source is not from something tracked by the JVM.
This is where things got uncomfortable. As the investigation progressed, I had joked to a colleague that being a professional Java developer should have meant that I shouldn't need to go this deep. He responded with, “That’s like saying Java developers are only meant to write programs, and whether those programs actually run is someone else's problem.”
We knew we had to start looking at allocations related to libraries that used native code. In Java, any function labeled with “native” can have a link to code that is executed outside the bounds of the JVM as “native” code. Any resources consumed by such code are not tracked by the JVM, nor are they garbage collected the way JVM objects are, whenever a garbage collection occurs. Perhaps even more troubling, an object on the JVM heap that references an object in native code might be easy to miss if you're using things like heap dumps or JVM profilers. As an example, in RocksDB an LRUCache object in Java might look to be sized on the order of bytes, but in fact can be measured on the order of gigabytes in total size outside the JVM.
RocksDB is the most prominent native library used in Venice—Venice is a storage service, and RocksDB handles the storage part. When we budget resources for our server, a majority of it goes to RocksDB block cache. Moreover, we have configured RocksDB (based on existing documentation) to help us be sure that significant memory allocations are billed to the memory usage limit we have specified for the block cache.
RocksDB is an open sourced software with a fairly dense code base. At this juncture, we began deep-diving and trying to figure out if anyone else in the RocksDB community had any similar trouble. We ended up finding a few tickets in the RocksDB community that talked about resource over-usage of one kind or another. Many of them seemed to converge on leveraging jemalloc to get to the bottom of the issue.
Jemalloc and Plumber
Plumber, written by Greg Banks (who happens to be the colleague who mocked me earlier), is a tool written in Perl that can look at process core dumps and help categorize allocated blocks with glibc's malloc. Plumber walks the data structures in glibc and categorizes memory allocations as being one of the following: “Free” (unused), “Leaked” (no pointers to this assigned block), “Maybe Leaked” (pointers point to some part of the assigned block, but not to the beginning), or “Reached” (the block is addressable).
Jemalloc is a malloc implementation developed by Jason Evans (the “JE” part of the jemalloc name). It comes with an impressive set of bells and whistles out of the box; most importantly for our purposes, it includes a set of tools for profiling memory allocations through the call stack. You can configure jemalloc to dump stats at intervals based on time, intervals based on allocations, or whenever a new high watermark of memory has been breached. You can search and categorize these stats on the commandline with jeprof, or have jeprof paint you a nice picture to show the call stacks of where memory is going. An abbreviated example looks like:
But we didn’t find anything. There's a Medium article in which the author describes having a few weeks of “existential crisis” followed by a “satisfying conclusion” when they managed to use jemalloc to get to the bottom of their issue. But this sweet relief eluded us. The trials we performed did not seem to indicate any obvious code leaks in our usage of RocksDB or in RocksDB's internals. In fact, the amount of memory purported to be being used was well within parameters. So what was the source of our problem!?
Maybe a breakthrough?
We did find something though. When we used jemalloc, the problem went away. Resident Memory was no longer perpetually increasing and remained stable. This is why the profiling tools made available through jemalloc didn't find anything—there was no longer any problem to find. Resident Memory had stabilized when we used jemalloc, and when we canary tested it in production, we saw that nodes which had started using jemalloc were doing significantly better then their cousins which had not.
While we could all exchange high-fives and go home, that didn’t seem like it would fully solve our problem. Why was it that using jemalloc made this problem go away? To understand why it helped, we needed to look at and understand what changed.
Our servers are using a Linux distribution that has glibc as the default allocator. This isn't out of the ordinary. In fact, glibc is probably the most pervasive malloc implementation out there, and there are some great resources for understanding how it works. What makes glibc and jemalloc different, though, are their respective design goals. Most importantly, jemalloc tries very hard to both put an upper bound on memory fragmentation, and to also return memory back to the operating system.
Returning memory to the OS
This was a new notion to me. Surely a good piece of code, which behaves itself and puts away its toys when it's done, will keep its resource usage to a minimum? As it turns out, that isn't exactly the case. It really comes down to the behavior of the allocator, and most allocators won't necessarily return memory right away to the OS. Jemalloc has a few ways this can be tuned, but by default it uses a time-based strategy for returning memory back to the OS. Memory allocations are serviced first by memory already held by the process that is free for reuse, and then failing that, new memory is acquired from the OS. Given enough time, should a chunk of memory not have anything assigned to it, it will be returned (and if returning resources quickly is important to you, you can configure jemalloc to return the memory immediately).
Now let’s look at how glibc malloc works. It uses mmap() and brk()/sbrk() calls in order to get more memory from the OS. mmap() will provision a chunk of memory of a given size and return a starting address. brk() (and sbrk()) will change the location of the program break, which defines the end of the process's data segment. Memory assigned to a process in this way can only be returned back to the OS if sbrk() is called in such a way as to rein in the end of the data segment (shrink it), or if madvise(MADV_DONTNEED) or munmap is called on an mmap()'d range. This is illustrated in the below diagram.
These can only be safely used so long as, in that address range which would be returned, there are no longer any active allocations being used by the process. If you run pmap on a process, you can see where these two commands come into play in the process's virtual address space. The first portion starts at a low address, and then after a few lines, it'll suddenly go to a high address. High addresses are mmap'd and brk()/sbrk() control the lower range.
Allocations are done via brk() and others via mmap(). Smaller, manageable allocations are done via brk(), while larger ones are handled via an mmap() call. Knowing how this worked, I created a bit of simple C code with a pathological pattern. It would allocate objects in two waves, and then it would delete the earlier allocations while keeping around the ones that came later. This renders the address range where they would be allocated only half-utilized by the process, but still uses the memory of all allocations made thus far.
You can watch this in action with the following code and top (though the behavior may differ slightly depending on your system’s allocator):
Running the above and charting the memory usage produces the following graph (where each data point is a stop point in the above code).
Memory usage with default system allocator (glibc)
The second and third points are after allocating some objects. The last data point is after freeing the first set of allocated objects.
What we found was that, as allocations went up, memory would also go up. However, as objects were deleted, memory would not go back down unless all objects created at the top of the address range were also removed, exposing the stack-like behavior of the glibc allocator. In order to avoid this, you would need to make sure that any allocations that you expected to stick around would not be assigned to a high order address space. If it were, that would mean that memory could not be reclaimed and would stick around. Running the same program with jemalloc shows more predictable results. As objects are allocated, memory goes up, but as objects are deallocated, memory goes down. If you download jemalloc or have it installed on your system, you can try it out with the following command (and the above code):
Memory usage with the jemalloc allocator
The dip is after deallocating the first batch of objects.
Memory fragmentation is a bit of a loaded term and can mean a lot of different things depending on the context (you can have fragmentation of physical memory (RSS), virtual memory (RSS), or of memory managed by user code. First, let's parse what fragmentation is.
Here’s an allegory for the situation. Let's say you're a city planner and you are trying to divide up a plot of land into houses with addresses. As families show up, you ask Mayor Kernel for some land, which they give to you in a single, large, contiguous block with addresses for the land plots. You start assigning plots from the lowest numbered address. Depending on the size of the family you have to accomodate, you combine adjacent plots for bigger houses for bigger families. Initially this works fine—folks are showing up and everything is filled. At some point, you've assigned all the land, so you ask Mayor Kernel for more. But now, people start leaving even as more come in. Smaller families move out, and if you have a family coming in of the same size, that plot can be reused. However, if a large family arrives, you need adjacent addresses to combine into a house to fit them. The only place this is likely to be available is at the higher number addresses (newer land plots). After a while, you might begin to notice that even though you have a large number of empty houses/addresses in total, you need more land to build new houses because you need enough space to build a house large enough to fit the incoming families. Families cannot be split up (as allocated memory must be in a contiguous address space, also that would be cruel).
This describes “external” fragmentation, but there is another flavor called “internal” fragmentation. Internal fragmentation can take another form where space that was assigned in a block is underutilized. Extending the neighborhood allegory, it would be as if we started putting small families in larger, previously-allocated houses. A house that was sized for six originally gets used for a family of four, meaning we’re now underutilizing the space.
This is what is happening when we talk about fragmentation of memory. Memory fragmentation is when your memory is allocated in a large number of non-sequential blocks with gaps that can't be used for new allocations due to size differences. The effect is that your process will be assigned an amount of memory where a good percentage is unusable.
A major difference between glibc malloc and jemalloc is that jemalloc is designed to give an "upper bound" fragmentation rate. In jemalloc, allocations are categorized by size, and bins are assigned to allocations based on what fits the size requirement the best; it guarantees that internal fragmentation will never be more than 20% for a size range, as memory bins are selected by the size which best matches the allocation. In glibc there are configurations that can be used to mitigate this (mmap() threshold, for example), but they do require that you know the size ranges of allocations going on in your process in order to be used effectively.
So between what memory is being held onto, and what is getting fragmented, how can we figure out what's getting left on the floor in our process?
gdb-heap is a fantastic resource. It leverages gdb's Python shell for executing macros that know how to parse and walk the structures of glibc for either a running process or a core dump. With it you can get all kinds of insight into what's happening in glibc's malloc for a given process, including things like stats on all the individual arenas and their heaps, as well as allocated chunks.
With some slight tweaks, we can create a simple function borrowing the existing code that prints out all the free and available chunks across all arenas. By writing a small amount of aggregation code, we can build a simple report like the following:
This was from a lab node. We found that with glibc, the total amount of unused memory could go as high as 30% in some cases. This was where our memory growth was coming from!
Conclusion and future work
We had originally picked jemalloc in order to leverage its profiling capabilities, but we ended up finding it to be a complete solution. Following this analysis, we now have installed jemalloc across Venice servers at LinkedIn. Since picking it up, we no longer see the steady RSS growth in our servers. This post only encompasses part of what we ended up tuning and analyzing to get the most out of our systems resources, though. It would be remiss to not have a follow up talking about how we configured Linux, RocksDB, and the JVM to get the most bang for our buck, and other efforts to tame and track other resource hogs in the system. Keep an eye out for more posts to follow!
This took an immense amount of work across a number of teams at LinkedIn. Specific shout outs to Olivier Lecomte, Greg Banks, and Ashish Singhai for their immense knowledge on this topic and for giving very helpful suggestions for avenues of investigation; Ali Poursamadi, Gaojie Liu, Kian Chung, Zonia Harris, Salil Gohkale, and Vinoth Govindaraj for being all hands on deck for methodically working through this investigation as well as helping to keep the site up and healthy; and also to Yun Sun and again Olivier Lecomte and Ashish Singhai who also helped review this article.