Making LinkedIn's Organic Feed Handle Peak Traffic
May 17, 2018
LinkedIn has recently experienced issues with handling peak load in FollowFeed, our custom timeline service that powers the organic updates members see in their activity feed. Examples of organic updates are text/article/image shares from members you are connected to, updates about comments or likes from your connections etc. In other words, the interactions from people in your social network at LinkedIn. Organic updates comprise the vast majority of all updates seen in the feed.
In this post, we’ll be talking about issues we’ve seen with handling load tests of site traffic in FollowFeed.
How this all got started
On Monday, Feb. 5, 2018, there was a scheduled load test of one of LinkedIn’s data centers. We regularly perform such load tests to ensure that our systems can handle the traffic growth we foresee coming in the next few months.
During this load test, FollowFeed had issues serving the site traffic; our main metrics for “quality of responses” took a nosedive.
A rough understanding of FollowFeed’s architecture is necessary. When you load the LinkedIn homepage, services “above” FollowFeed receive your request and send another request to followfeed-query, our broker layer. That service then creates multiple parallel requests to followfeed-storage (which stores member activity timelines). Followfeed-query then collects all the responses, merges them together, and returns the top results back.
We have two metrics we track for QoS (Quality of Service):
- Full QoS. The percentage of queries that received data from all the member timelines that were queried comprises Full QoS. So, if 100 timelines were requested, and those timelines are on 10 followfeed-storage machines, and all 10 fetches (including retries) were successful, then the Full QoS for this request was 1. If 9 out of 10 fetches succeeded, then Full QoS is 0. The top graph below shows the ratio of queries with Full QoS versus total queries received.
- Partial QoS. For the above example, Partial QoS is 1 for both 10/10 successful fetches and 9/10, all the way to 1/10. As long as we received some data back, Partial QoS is happy.
This distinction between Full and Partial QoS is important: if Full QoS takes a dip, the relevance of the returned results might suffer, but the service is still returning responses to members. But if Partial QoS takes a dip, that means we returned no results, which means the members see only non-organic updates in their feeds.
The graphs above are pretty bad; 10% of all calls got no results and a full 40% got a degraded feed experience.
We identified two high-level problems during this load test:
We didn’t appear to have the capacity to handle the planned increase in load.
We weren’t gracefully handling the extra load. If we can normally take X amount of traffic without issues and we start seeing X + 5% of traffic, we want only the 5% of excess traffic to see errors, not more than that.
Both of these are problems and both needed to be addressed. But first, a root cause analysis.
We noticed that our JVM OldGen Used Memory had spiked up on some machines the previous Friday, after which the JVM started doing back-to-back GC collections, which didn’t reduce the used memory at all.
This was an unhealthy state, but because of redundancy in our service design, other machines hosting the same partitions managed to prevent this issue from affecting site traffic until Monday, when we did a scheduled load test. At that point, the problem blew up.
We looked at the history of these graphs and saw similar spikes in the past, but they would not last for two full days (or anywhere near that).
Heap dump analysis
The obvious next step was analyzing heap dumps to identify the source of this increase in RAM usage. We tried using jhat, but it was failing to open our heap dumps (which are sizeable, at 60+ GB). During an unrelated chat with Ben Goldsbury, one of our SREs, the topic of problems with analyzing our heap dump came about; Ben mentioned Eclipse MAT. He had used it in the past with great success. He ended up helping us get up to speed with MAT by running MAT reports on the file, answering tons of our questions, etc.
We took it from there and quickly found the culprit:
The cache objects are normal, but the Jetty ServerConnector using ~7.5 GB of RAM wasn’t.
An aside about how Eclipse MAT works: it can do many things and generate lots of reports, but one of the most useful tools MAT offers is the Dominator Tree. This tool sorts all objects by their retained heap. Note that this is different than an object’s shallow heap.
An object’s shallow heap is the size (in bytes) of just that object in memory, not counting the size of any objects that this object might be referencing. An object’s retained set (not heap) is the set of objects which would be removed by the GC if this object were removed (in other words, other objects exclusively reachable through this object). The object’s retained heap is the total shallow heap of all the objects in its retained set.
To illustrate this, imagine a Java List<Integer> with a million entries in it. The List object itself occupies only a handful of bytes, mostly counts, and, critically, a reference to an array of objects where the data is stored. It is that array that is the large object, not the List itself.
So the List has a tiny shallow heap, but a large retained heap.
In the above picture we can see that the ServerConnector is holding onto a very large tree of objects. Here’s what those objects are:
It’s 500,000 SelectChannelEndPoint objects in a ConcurrentHashMap used as a Set. The biggest contributor to the size of that object is an HttpConnection object which stores our full request bytes.
So Jetty appears to be holding on to 500K requests. This did look like a bug in Jetty, but the next step was to look at our custom Jetty configuration to verify if we were doing something silly to cause this behavior.
And silliness we did find, though not enough to explain what we see here. But before we talk about what we found, we need to talk about queues.
Queues in systems at scale
It’s important to understand how queues between producers and consumers behave at scale. We won’t go into much detail (there’s literally an entire field of math devoted to this topic), but we will cover the parts relevant to our discussion.
Imagine we have the following setup in a server application:
The producer is adding items to the queue (from multiple threads) and the consumer is taking items from the queue (also from multiple threads).
Now assume that the producer is producing at a constant rate of 900 Queries Per Second (QPS). The capacity of the consumer is 1,000 QPS; that’s how much it can process. How many items are in the queue? Usually about a couple of items; one or two, maybe up to five. It depends on how threads get scheduled.
In such a setup, the maximum size of the queue (its configured capacity) doesn’t need to be big; the queue is acting as a hand-off point.
Now imagine that the producer starts producing at a faster rate than the capacity of the consumer. Say at 1,100 QPS, while the consumer can still only consume 1,000 QPS. How many items are in the queue? If the size of the queue was 10, then the number of items in it is 10. If it’s 1,000, then the number of items grows by 100 every second until it gets full. The queue very quickly becomes full, regardless of size.
If the queue is configured to reject items when full, the producer will receive feedback from the consumer that it can’t process items at the rate they are being produced. This feedback is called “backpressure.”
But in a production system, there is rarely just a single producer, queue, and consumer; there’s almost certainly a chain of these, where a consumer is also a producer for the next link in the chain. In these situations, we want the backpressure to propagate throughout the chain quickly.
Correct management of backpressure is critical to the health of systems at peak load. If little thought has been put into backpressure, then once the system hits its capacity limit, it won’t degrade gracefully even if theoretically it could have handled the peak load. In fact, it might just enter a death spiral.
The most common example would be a queue with an inappropriately large maximum size (or worse, an unbounded queue). The queue can start taking up more and more RAM, triggering GC more frequently (in a system with a GC) and thus reducing the throughput (and thus capacity of the system). The queue then becomes even fuller, the GC triggers even more frequently, and down the spiral of death we go.
The problem isn’t even limited to languages with a GC. Similar issues arise with threadpools with unbounded (or very high) max active thread limits. If thousands of threads are all trying to make progress (and few, if any, are blocked on IO), the capacity of the system plummets due to the cost of context switching (and the side effects of that, like CPU cache thrashing, etc.). Naturally, this further increases the processing backlog.
This doesn’t even mention possible failure to handle the backpressure feedback on the producer side, which brings a separate set of problems.
Auditing our Jetty configuration
Cognizant of the above, and with a hunch that some Jetty-related queues were too big, we started looking into every single Jetty config value we were setting. Some of those configs did stand out as wrong.
First, our Jetty request queue max size was set to 5,000. Experience and intuition told us this was likely wrong. Looking around, Jetty documentation recommends a value between 50 and 500. The documentation makes an important point: make sure the queue is large enough to handle a traffic spike. Unlike our simple example in the backpressure section, producer load in reality can often be uneven.
Now let’s imagine a worst-case scenario: suddenly, 500 queries arrive in the queue at the exact same moment. We know that the client (followfeed-query) has a total request timeout of 400 ms. We also know that our peak capacity for a followfeed-storage machine at this time is about 1,200 QPS. So in that 400 ms budget, we’ll be able to respond to 1,200 queries/s * 0.4 s = 480 queries. The other 20 queries we will still process and respond to, but the client won’t care (it has already moved on because of the timeout).
So that’s roughly the number of queries we can be usefully buffering up. With a request queue size of 5,000, it would take us ~4.17 seconds to respond to all of them (assuming the machine isn’t melting!), and ~4,520 responses would be wasted work.
Thus, we reduced the request queue size to 500. This is still actually too high because followfeed-query won’t sit idly by while 400 ms pass; it will start sending an identical request ~150 ms after the first one (to a separate followfeed-storage replica) if the request hasn’t arrived by that point. Those other replicas are likely to have responded to a large fraction of those 480 queries before the machine with the full queue does. We’ll be further tuning the request queue size in the near future, but a 10x reduction should be immediately helpful.
The second queue that appeared to be too big was Jetty’s acceptQueueSize parameter. We had it overridden to 2,000. Here’s what the Jetty docs have to say about it:
“The size of the pending connection backlog. The exact interpretation is JVM and operating system specific and you can ignore it. Higher values allow more connections to wait pending an acceptor thread. Because the exact interpretation is deployment dependent, it is best to keep this value as the default unless there is a specific connection issue for a specific OS that you need to address.”
You’ll notice that this has an admonition to not touch the default without a very good reason. This piled yet more suspicion on that 2,000 size, because the default is 0. Now 0 clearly looked like a sentinel value, so the question became, “What is the default?” Digging through code, we can see that Jetty passes the provided acceptQueueSize value to ServerSocket.bind() as the value of a backlog parameter. ServerSocket is a standard class in the Java Runtime Environment (JRE), so we can look into its source to see what it does when the provided value is 0 (the JavaDocs state the value is “implementation dependent”).
The source code shows that the actual default value is 50 and is provided to the listen(2) syscall. (For those that want to learn more about this TCP connection queue, here’s a great article.) For our purposes, knowing that the JRE default is 50 (with JRE defaults usually being sensible), and that we were using 2,000 for no obvious reason, was enough to experiment with using the default.
So we packaged up this config change along with the request queue size change and benchmarked them with a copy of production traffic on an isolated machine. This experiment was a success.
Deploying this change to production helped us handle excess load gracefully; we now reject requests before they eat up RAM, and the smaller request queue means it takes us less time to recover. The smaller TCP connection accept queue means we start rejecting connections when under load, which works well because those requests will be retried on other replicas.
But these changes did something else, too: they increased our capacity. We now see far fewer retries attempted from followfeed-query for the same amount of site traffic. The request queue size reduction lowered GC pressure when the machines are under heavy load. The TCP accept queue size reduction provided backpressure to followfeed-query machines that are trying to reconnect to the followfeed-storage machine that was returning errors (and closing connections on the client) when under load.
Over-exuberant retry strategy
As is common with wide-ranging investigations that look for possible causes of production issues, we’ve found plenty of problems that needed addressing besides those listed above.
One of the biggest issues was with our retry strategy in followfeed-query (the broker service in front of followfeed-storage). When a followfeed-query machine sends a request to a followfeed-storage machine hosting a particular partition, it waits for about 150 ms before concluding that something is very wrong (median latency for followfeed-storage is ~8 ms and 99th percentile latency is ~100 ms). At that point, it makes the same call to a separate replica of followfeed-storage hosting the same partition (while still waiting for the older call to complete, just in case it comes back). There’s a third fetch attempt after the second one if that too is still pending after an additional 150 ms.
The reason why this is needed is because of GC pauses; if a followfeed-storage machine hits a long GC pause, the calls that it is working on need to be retried for the sake of reducing user-visible latency.
Now imagine what will happen when a followfeed-storage machine spikes up in latency for a temporary reason (GC pause, loading data from disk into caches, etc.). A part of the call volume it was getting will be redirected (temporarily) to other machines hosting the same partitions. Since we have 4 replicas hosting an identical set of partitions, 3 machines now start taking more traffic. If those machines were near peak operating load, they start having issues as well, causing more latency increases, which causes more retries from followfeed-query, which causes yet more latency increases…this becomes a death spiral.
This is fairly obvious, so naturally, followfeed-query had a system in place to prevent this. There were (well-tuned!) rate-limiters in place for the second and third fetch attempts to stop such a situation for happening. Unfortunately, we realized that the rate-limiters were global, not per-host. In other words, the total number of retries across all the followfeed-storage machines was rate-limited in followfeed-query (and the max retry call volume was capped at ~2.5% of peak non-retry volume). Unfortunately, this meant that if a single set of machines hosting a set of partitions started having issues, there was more than enough “retry call budget” to drown those machines in 2x more requests, effectively DDOS-ing the system.
So, we made the rate-limiters per-host.
After thorough testing of this change, we deployed it everywhere. One of our data centers didn’t have the Jetty configuration changes deployed by that point and one did, so we could make direct comparisons between them (at comparable levels of traffic), since they used the same hardware. For the data center with only the per-host rate-limiter changes, we could see a sizable number of retries getting blocked by the new rate-limiters (and the death spiral was being prevented, as expected). For the data center with both the rate-limiters and the Jetty changes, almost no retries were being blocked. Thus we know the Jetty changes increased capacity.
Other ideas we tried
For context, the original configuration values were set several years ago. Knowledge of why those values were chosen is lost to time. They probably seemed like good choices back then, but the system had never really seen load that exceeded capacity until now; thus, there wasn’t a pressing need to tune those parameters.
We found several other parameters that needed tuning and other places where we could add graceful degradation as a result of our investigation. Some of these issues were hurting peak load behavior, and some these we tuned out of caution.
TCP connection limits in followfeed-query were wrong. We had min/max TCP connections per followfeed-storage host set to 10/100. We only really need 0.4 TCP connections per (followfeed-query, followfeed-storage) machine pair, even for peak load. Reducing the min/max to 1/4 completely removed TCP connection spikes during load tests. The spikes had been 7x normal TCP established connections on each followfeed-storage machine; after the change, the TCP connection count remains flat during peak load.
HTTP request timeouts at the library level were not tuned to match our application-level timeouts. App-level timeouts were 400 ms, while HTTP request timeouts were 3.5 seconds. This means that after a long GC pause, followfeed-storage could still try to process some requests that the client has long since stopped caring about.
We had no request rate-limiting inside followfeed-storage. We know the QPS at which we start failing, so putting a rate-limiter on the server end is useful for graceful degradation and providing backpressure by returning 429 Too Many Requests responses.
Out-of-date GC configuration. We haven’t needed to tune our GC configuration for a long time, and the configuration we had in production wasn’t a great match for today’s workloads. Tuning these parameters helped address GC pauses.
Another critical issue we identified was incorrect metric computation. We discovered that the QoS metrics we showed at the start of the blog post were overcounting errors by 300%; several years ago when they were created, the data produced was correct, but because of architectural changes made since then, the algorithms computing the metrics were out of date.
We’ve rebuilt our QoS metrics computation from scratch so that we account for these changes, add additional accuracy, and include new QoS metrics for an even clearer picture of production behavior.
We’ve also identified future work that will improve our capacity, graceful degradation, and ability to debug issues like this in the future.
Our request orchestration code in followfeed-query has accumulated technical debt over time that makes it very hard to change and reason about. We’ve also learned a lot about the behavior of our system since it was launched several years ago; that knowledge can be put to good use in request orchestration. We expect to be able to improve the reliability of the system here.
The way we distribute data partitions across followfeed-storage machines isn’t ideal for peak load. Today, followfeed-storage machine replicas host the same 20 data partitions (with 4 replicas). Thus if a followfeed-storage machine is having issues, the request volume gets distributed across 3 other machines. If we assigned partitions heterogeneously (where every machine would effectively have a random set of 20 partitions) while still assuring at least 4 machines host each partition, then when a machine starts having issues, the call volume would be distributed to ~80 other machines instead of 3. This would massively reduce the extra load per-machine.
Here’s why it would be 80 machines: one machine is hosting 20 partitions. There are 4 machines hosting each partition. With effectively random partition allocation, the total number of machines involved that host at least one of the 20 affected partitions is 4 * 20 = 80 machines. (Technically 79, because one machine is having problems.) Note: This calculation assumes that no machine is hosting two or more partitions from the affected set.
Missing service-level continuous headroom testing. Our issues with reaching capacity limits were only discovered during site-wide load tests. Ideally, we’d be able to identify these limits well before that point.
We’re still not sure what was the real underlying cause of that blow-up in Jetty RAM usage; we know it’s related to the number of in-flight requests and connections, and working to reduce those did help. We have pulled in the Jetty experts at the company to look at our issue.
But that issue was only one of many problems we needed to address to increase capacity and improve our ability to gracefully degrade at peak load.
Meanwhile, the changes we’ve made have stabilized the system and we’ve successfully passed several load tests, doing great not only at our original data center traffic target but at a target 10% higher as well. We have plenty of follow-up work to improve capacity and pay down accumulated technical debt in our serving path, but that work has already begun.