Hodor: Overload scenarios and the evolution of their detection and handling

February 23, 2023

Co-Authors - Abhishek Gilra, Nizar Mankulangara, Salil Kanitkar, and Vivek Deshpande

Introduction

To connect professionals and make them more productive, it is crucial that LinkedIn is available at all times. For us, downtime means that our members and customers don’t have access to the conversations, connections, and knowledge that are essential to them achieving their objectives.. For linkedin.com to achieve our goal of 99.9% availability, there are ~1000 carefully orchestrated services working in tandem. This means that all of these underlying services need to be highly available and reliable.

Offering such high availability to deliver the desired member and customer experience necessitates being able to, in real-time, identify and remediate different types of overloads that can plague a system. Left to their own devices, overloads can prove critical, eventually bringing down the entire site.

In our previous blog post, Hodor: Detecting and addressing overload in LinkedIn microservices, we introduced the Holistic Overload Detection and Overload Remediation (HODOR) framework. That blog post introduced the first detector we had designed for Hodor, the heartbeat detector, and discussed our concurrency based load-shedding strategy to remediate overloads. We also discussed how member impact is minimized by safely retrying dropped requests.

In this post, we’ll discuss overload scenarios that we have observed at LinkedIn, and revisit the goals of Hodor. We’ll also introduce new tools: our garbage collection (GC) overload detector and our application threadpool detector. Finally, we’ll talk about the concept of traffic tiering, which we introduced to minimize member impact while shedding load.

Goals of Hodor

Hodor serves to sense danger and protect against that danger (similar to the beloved fictional character). In this case, the danger is overloads to microservices. Here are the four main goals of Hodor:

Detect different types of physical and virtual overloads in real-time with minimal overhead. This is important because Hodor runs on all services at all times with the goal of improving performance and availability. So any overhead because of Hodor is highly undesirable (and somewhat ironic).
Take evasive action to mitigate overloads to improve resilience and availability, which currently includes load-shedding.
Be an out-of-the-box solution that works for all LinkedIn services without service owners or SREs needing to tune the algorithm.
Result in a net positive member impact. This is the most important goal – we built this solution to improve member experience so a net negative member experience because of Hodor is unacceptable.

Holistic overload detection

Hodor detects different types of overloads using three detectors:

The heartbeat detector for CPU overloads (covered in our previous blog post)
The GC detector
An application threadpool detector

Additionally, Hodor uses a latency confirmation filter to prevent false positive detections.

As the CPU heartbeat detector was covered in our previous post, we will focus on the GC detector, the application threadpool detector, and the latency confirmation filter.

Garbage Collection (GC) detector

LinkedIn production is primarily composed of Java microservices running on the Java Virtual Machine (JVM), therefore, we needed a mechanism to quickly and accurately detect increased garbage collection activity from auto-memory management. The idea is to observe the overhead of GC activity to detect overloads in real-time.

On each GC event, an amortized percentage of time spent in GC over a given timeframe is calculated, which is called the GC overhead. A schedule is applied on top of this GC overhead and the GC overhead percentage range is divided into tiers called GC overhead tiers. If the duration spent in a GC overhead tier exceeds the violation period for that tier, a GC overload is signaled. The violation period is smaller for higher GC overhead tiers because a higher GC overhead tier indicates more severe GC activity.

Application threadpool detector

A downstream service might become slow to respond, causing requests to cumulate upstream. This is known as backpressure. For example, in Figure 1, traffic moving from server A to server B to server C flows smoothly before server C starts to slow down. This results in server B experiencing backpressure from server C. This backpressure can percolate all the way up the call-tree to server A if left unchecked.

Graphic of formation of server backpressure

Figure 1: Formation of server backpressure

Backpressure typically results in a virtual resource overload (e.g., threadpool exhaustion) before a physical resource limitation is hit.

This is currently Hodor's only virtual resource overload detector. When developing the threadpool detector, the first metric that we considered was queue length because it is bound to grow when there is backpressure. The idea was to monitor the queue length to perceive the overload state of a service. However, this was quickly dismissed because the queue length is difficult to correlate back to a business impact indicator, such as client perceived metric. Also, the queue length at which a service can be considered overloaded may vary based on service design and business logic characteristics.

Queue wait times addressed the above concerns. Via experimentation, we found that there is an inflection point where queue wait times reach a point of no return for a service. Identifying this allowed us to design a detector that casts a wide net on service overloads. This helps us avoid client-perceived latency, which can result in a very real business impact.

Graphic of threadpool overload detection using queue wait times

Figure 2: Threadpool overload detection using queue wait times

Consider a synchronous service — requests spend more time waiting in the queue if current request processing time increases, either due to an issue in the current service or in one of its downstreams. The capacity of a service can also be reached when latencies of downstream traffic increase. This can cause the number of concurrent in-process requests in the current service to increase with no change to the incoming request rate. Without knowing anything about the downstream service, the upstream service can monitor its own threadpool queue wait-time to ascertain a downstream service's distress. This can then be alleviated by dropping traffic at the upstream itself. The concept of observing wait times to detect virtual resource overloads can be generalized and widely applied.

Overload detection case study

The following three graphs show distinct examples of overload in production systems. Notice that each of the three detector firings correlate with regressing business metrics, therefore showing successful overload detection.

Overload detection using heartbeat detector

Graphic of heartbeat detection with corresponding latency increases in production

Figure 3a: Heartbeat detection with corresponding latency increases in production

Graphic of GC overload detection with corresponding latency increases in production

Figure 3b: GC overload detection with corresponding latency increases in production

FiGraphic of threadpool overload detection with corresponding latency increases in production

Figure 3c: Threadpool overload detection with corresponding latency increases in production

Latency confirmation filter

As mentioned previously, Hodor’s overload detectors are designed to optimize for precision over recall. However, after Hodor was rolled out to all LinkedIn services in production (about 1% of services), suboptimally-tuned garbage collection activity could result in micro stop-the-world GC pauses and the background heartbeat thread to experience CPU starvation. This would cause the heartbeat detector to fire a CPU overload even though there is no discernible impact on business metrics such as service latency.

To account for business impact, we designed the latency confirmation detector which uses a modified moving average crossover algorithm. Two sliding windows of significantly varying sizes are used to compute a slow moving average and a fast moving average. When the fast moving average crosses over the slow moving average, shifted higher by a certain factor, a latency confirmation is flagged. This algorithm is far from perfect for standalone anomaly detection, however, it works well with the heartbeat and GC detector as a confirmation filter to detect sudden and significant increases in latency.

Using latency as a metric to develop a primary detector was never a viable option because not every latency increase is an overload, but every physical resource overload eventually results in a latency increase.

Latency confirmation filter case study

In Figure 4, the top row contains graphs showing the pure heartbeat overload signal firing even though there is no impact to latency. The confirmation filter successfully suppresses this false firing. The bottom row in this diagram contains graphs showing the pure heartbeat overload being confirmed by the latency filter because there is indeed a latency impact.

FiguGraphic of latency confirmation filter suppressing possible false positives in heartbeat detection

Figure 4: Latency confirmation filter suppressing possible false positives in heartbeat detection

Overload remediation

LinkedIn microservices receive requests from members actively using the platform and from offline systems that need additional data from services. Needless to say, requests from members are more important than requests from offline jobs. Therefore, to prioritize dropping lower priority requests first, we introduced the concept of traffic tiering.

Tier Assignments

At LinkedIn, there are three traffic tiers: optional, degradable, and non-degradable.

Optional requests are those that can be dropped with no user impact at all. Most of the traffic from offline/nearline systems falls into this tier.

Degradable requests have some impact on the user experience, but if dropped, the impact is minimal. The primary use case for this tier is for product owners to mark requests loading degradable features on a certain page. This will help protect the backend during an overload. In the offline/nearline world, this is also used for the traffic that is time sensitive or critical due to other business requirements and considered more important than the generic offline/nearline requests.

The non-degradable tier is used for requests from online members which cannot be degraded.

To support future evolution of tiers, a numeric tier ID has been used to allow relative comparison and further prioritization of requests if needed. The optional tier is represented as tier ID 1000, the degradable tier uses ID 5000, and the non-degradable tier uses ID 10000. The bigger the number, the more important the traffic, and using larger numbers for more important traffic gives us more room to go higher.

The traffic tier of a request is passed along the distributed call tree. Once the traffic tier is set, it is used for the rest of the call tree. Irrespective of where the request originated from, the first service/application in our data center that handles the request sets a tier in the distributed call tree if one is not already set.

Load shedder integration

Requests are prioritized for dropping based on these traffic tiers. Optional requests are dropped in favor of degradable requests, whereas degradable requests are dropped in favor of non-degradable requests. When there are enough optional/degradable requests present, we avoid dropping non-degradable requests to minimize user impact of load-shedding.

Within a tier, users are placed into different buckets by simply hashing member IDs. We drop requests from selected buckets to ensure a consistent experience and minimize the number of impacted call trees, and therefore members and customers. If the service is unable to recover after dropping a small number of buckets or lower tier requests, the threshold can be raised to shed more buckets or move to a higher tier. This process can continue until the system enters a stable state, or is able to recover fully and again begin accepting all traffic.

In our previous blog post, we discussed a concurrency based load-shedding algorithm, which employed concurrency as the normalization metric to represent the load on a system. In the concurrency-based load shedder, the concurrency value was used to modulate the load on the system by deciding how much traffic to shed. During the integration of traffic tiering with concurrency based load shedder, we found that concurrency based histograms do not work as expected because the requests dropped do not contribute anything meaningful to the concurrency histogram.

Due to this limitation, Hodor introduced a rate based load shedding strategy and a corresponding rate based histogram per tier. The rate based algorithm works the sameas the percentage and concurrency based algorithms described earlier, except that it uses throughput as the load metric. The load shedder now uses a throughput based histogram to track traffic tiers based on the incoming traffic patterns.

At runtime, for each endpoint, a histogram is built for the distribution of traffic tiers and user buckets of the incoming requests, which allows for more effective load-shedding. For instance, if a service only receives degradable traffic, then the algorithm knows that there is no lower priority than degradable to shed.

The traffic-tiering, load-shedder queries the histogram to identify the right traffic tier and bucket to drop. Any traffic lower than the identified tier and bucket is dropped by the load shedder on overload.

Loadshedding to prevent overload case study

The following graphs (Figure 5) are from an A/B experiment that was conducted. The red host was not protected by load shedding, whereas the blue host had load shedding enabled. Observe that as the overall QPS increases, the protected host is forced to increase the number of requests that are dropped. Also observe that the overall tail latency is lower on the protected blue host, however there are a few spikes where the load shedder is probing to increase the concurrency limit. In contrast, the tail latencies on the unprotected red host are much worse.

Graphic of load shedding and corresponding improvement in business metrics as a result

Figure 5: Load shedding and corresponding improvement in business metrics as a result

Conclusion

Hodor is currently running on 1000+ microservices at LinkedIn and has prevented hundreds of overloads in LinkedIn production systems. Due to its success and accuracy, there is widespread interest across LinkedIn to expand the use of Hodor’s signals for autoscaling and safe load testing in production. New detectors are also being considered for other types of virtual overloads and overloads in data services.

Under the watchful eye of Hodor, we hope to keep our teams safe from overloads and our members and customers safe from downtime.

Acknowledgements

This project would not have been possible without the dedication and hard work for multiple quarters by a wide range of talented people. A huge thank you to the core team: Winston Zhang, Sean Sheng, Bryan Barkley, Bingfeng Xia, Supriya Madhuram, Dmitry Shestoperov, Chris Stufflebeam, Bill Chen, Chris Zhang, Rick Zhou, Ron Meng, John Stewart, Yaojia Lyu, Neha Surendranath, and Deepak Bhatnagar. Additionally, a ton of gratitude goes to Ash Mishra, Goksel Genc, Ritesh Maheshwari, Shiva Raman, Eddie Bernard, and Heather McKelvey for their support and encouragement of the project. Special thanks also to Scott Meyer, Diego Buthay, and Manish Dubey for their guidance and feedback. Finally, thank you to Kayla Gugliemo, Nishan Weragama, Greg Earl, Benito Levya, and Katherine Vaiente for helping to review and refine this post.

Topics: Data Management Infrastructure