Overcoming challenges with Linux cgroups memory accounting
June 8, 2022
LinkedIn's de facto search solution, Galene, is a Search-as-a-Service infrastructure that powers a multitude of search products at LinkedIn, from member-facing searches (such as searching for jobs or other members) to internal index searches. Galene’s responsiveness and reliability are paramount as it caters to many critical features.
This post discusses debugging an issue where the hosts ran out of memory and became inaccessible, even though the applications are limited by cgroups. We’ll cover memory accounting in cgroups and how it is not always straightforward when there are multiple variables at play. We will also discuss a case where cgroups, in certain cases may not account for the memory according to our expectations, which can be disastrous for co-hosted applications or the host itself.
This issue arose from one of the services in the search stack, the searcher-app, which is responsible for querying search indexes. The indexes are stored as flat files in a binary format specific to Galene and loaded into the searcher-app’s memory using mmap() calls. The application also uses the mlockall() call to keep the file in memory and disable paging, as fpaging can cause extremely high tail latencies. When mlockall() is not used, the Linux kernel can swap out pages that are part of the index and not frequently accessed. A query requiring one of those sections will require disk access, which will increase the latency. Searcher applications, like a number of other apps, are hosted on containers and use memory and CPU cgroups to limit resources used by an application or system process on the host.
Issue 1: Low memory leads to excessive page swapping and high latency
We received an alert notification that one of our search clusters was having issues and noticed that many of our searcher-apps were down. When we tried to restart the apps, we saw that the physical host itself was not responding and needed a power cycle via the console to get any response. A few observations to note from the debugging are that before going into the “unresponsive” state, the system had a memory crunch, and once it had entered into the “unresponsive” state, no logs of any kind were generated on the host.
Fig 2: Host available memory graph
We noticed that the host was running low on memory and that there was also an increase in disk read times. This observation, along with an increase in page faults, led us to realize that the pages were being swapped too often because the host was low on memory, which led to high disk writes and slowed down read times. The search application was a major contributor to the lack of memory on the host. So, we optimized the searcher-app’s memory utilization and reduced the cgroup memory limit for the app, which in turn reserved more memory for system processes and resolved the issue.
Issue 2: An unknown cause for reserving large amounts of memory, leading to unresponsive hosts
In six months, we had the same problem on another cluster and during our debugging this time around, we uncovered something specific: the application tried to reserve a huge chunk of memory right before the system hung and pushed the host into an unreachable state. This led us to suspect Linux’s cgroup memory enforcement as the culprit. We wrote a small C program to try and reproduce the issue by running this reproducer inside of a cgroup under a few different memory overallocation patterns, but in all cases, the Linux OOMkiller was correctly invoked and killed off the application process. We could not simulate the host-hang situation so we had to look back at our OS metrics.
Once we established that the issue was a memory crunch, we began investigating the memory usage pattern on the host. Interestingly, we found that the application cgroup showed much less memory usage than expected.
Application cgroup total memory usage graph
The above graph shows memory usage of about 51GB before the node went unreachable. The red circle that marks the point it went unreachable is the point we will use for all of our further graphs. The ideal way to calculate the entire memory usage for the cgroup is Resident Set Size (RSS) Anonymous + page-cache + swap used by the cgroup. Because we use mlockall()we don’t use swap, so we don’t need to worry about that here. RSS is how much memory a process currently has in main memory (RAM). The cgroup stat file used for the following cgroup graphs only shows the anonymous part of RSS—the total RSS of a process is the sum of RSS Anonymous, RSS File, and Shared RSS. RSS File (which contains the mmapped files) will be accounted for in page cache and Shared RSS size is too low to be of any significance in the calculations.
Application cgroup page cache usage graph
From the previous graphs, if we add up the memory usage (19 and 31GB), it says that we use 50GB. That’s in line with the “Application cgroup total memory usage graph” shown at the beginning of this section.
Searcher application middle index size graph
From these two graphs, we can see that the base index size is 32GB and the middle index size is 12GB, which brings us to a total size of 44GB—the size of flat index files mmapped into memory. When we add the RSS value of 19GB, we get a total usage of 63GB.
So, the application is using 63GB of memory, based on the above calculation from the actual file size of the indexes and the RSS, which were verified by looking at the process on the host. This means that our cgroup is not reporting the correct amount of memory used for cache: we need 44GB of cache, but cgroup only shows 31GB.
The current hierarchy of our cgroups is
Application parent cgroup
Application 1 cgroup
Application 2 cgroup
Now, let’s compare the application cgroup page cache usage with the parent cgroup metrics. We wanted to compare the different cgroups to identify at which level the memory was not being reported as we expected.
Application cgroup page cache usage graph
The dip in cache usage by the application cgroup is due to a restart. After the restart, we see that the application cgroup is reporting the wrong numbers for the cache. We expect around 44GB of cache, but the application cgroup only shows around 10GB just after restart, while the parent cgroup still reports the right amount of cache usage.
OOMkiller will not kick in, even when the application is using more memory than allocated, because the application cgroup is not reporting the correct memory usage. This can cause the search application to hog memory on the box and other services to become starved for memory, which leads to swapping, and eventually the system becomes unreachable.
Understanding page cache accounting in cgroups
Let us first understand how memory is being accounted for in cgroups.
RSS: This one is simple. Just add up the RSS of all the processes under that cgroup.
Cache: Shared pages are accounted for on a first touch basis. This means that any page created by a process inside a cgroup is accounted for by that cgroup. If the page already existed in memory, then the accounting gets complicated. In this case, the page will eventually get accounted to the cgroup after it keeps accessing that page aggressively.
In our stack, restarts or redeploys follow these steps:
Delete application cgroup
Create application cgroup
In our case, we deploy new indexes and then the application’s cgroup reports the correct memory usage. Once the index grows and reaches the application cgroup memory limit, the OOMkiller is invoked and the application is killed. From there, our automation kicks in and starts the application. This leads to the existing application cgroup being deleted and a new one being created. But this time, the application cgroup memory is wrong. This is because the pages for the index are already in memory, but the new application cgroup is not accounting for this. As a result, the index keeps growing and the host faces a memory crunch, which leads to thrashing (Figures 1, 2). The OOMkiller is not invoked by the application cgroup because it reports less memory than is actually being used. Our application uses mlockall() so memory cannot be swapped; this leads to other critical system applications being swapped instead, and causes the host to go into an “unresponsive” state.
Validating the findings
We did a small experiment to validate our findings. We picked one host showing lower application cgroup memory usage and stopped the application and destroyed the cgroup, then got the machine to drop all its page cache. After that, we created a new cgroup and started the application inside it.
Application cgroup page cache usage graph
The application cgroup showed the right amount of memory after the above steps. This verified that the issue was caused by a new application cgroup not charging pages to itself, even if the application inside it is the only one using those pages.
First, we wanted to set up proper monitoring to catch the growth of indexes to avoid running out of memory. We used metrics emitted by the application to monitor the index size and tracked the RSS memory used by the cgroup to set up an alert that would let us know when a certified threshold had been exceeded. This gave us enough time to mitigate the issue before we ran out of memory, but there were some cases where a sudden increase in memory could happen, so we needed a failsafe to ensure that the host doesn’t go into an unresponsive state.
The total memory used shown in the parent cgroup is still correct, as previously discussed. When the old cgroup is destroyed, the parent still retains the total memory usage numbers, which include the page cache. To ensure that the OOMkiller is invoked when the parent is breaching its limits, we are planning to put a memory limit on the parent cgroup. Doing so can cause a noisy neighbor situation, where a different co-hosted application is killed rather than the one abusing the memory, but considering that the host will go unreachable and both applications will suffer if the memory situation becomes too overloaded, this is the best current solution to the issue.
While we did considere a few other solutions (listed below), we determined that they didn’t fit our needs.
Adding a cache flush each time a cgroup is created: this would unnecessarily affect other applications running on the host because of disk I/O using up CPU cycles.
Leverage /tmpfs to host indexes: this would require changes on the application side and a different configuration for searcher hosts.
Create a parent cgroup with limits per application cgroup: this would require extensive changes from the current provisioning and deployment tooling.
After evaluating all these approaches, we decided to go with setting a cgroup limit on the parent cgroup.
Debugging an issue is always filled with surprises and learnings. From this issue, we realized that memory accounting in cgroups can be complicated when page cache is involved. Using mlockall() can lead to critical services being swapped out when the application starts hogging memory. But most importantly, this process was a good reminder of the importance of challenging the assumptions we make during debugging—for instance, if we had questioned cgroup’s memory reporting during the initial issue, we would have had one less issue in production. After adding monitoring to detect the issue, we figured out that there were other clusters affected by this and we could fix it before it caused any production impact.
I would like to thank Kalyan Somasundaram and Mike Svoboda for helping me during the triaging. Also, a big thanks again to Kalyan for reviewing this blog post. Finally, I would like to acknowledge the constant encouragement and support from my manager, Venu Ryali.