Hodor: Detecting and addressing overload in LinkedIn microservices
February 18, 2022
LinkedIn launched in its initial form over 18 years ago, which is an eternity in the technology world. The early site was a single monolithic Java web application, and as it gained in popularity and the user base grew, the underlying technology had to adapt in order to support our ever-growing scale. We now operate well over 1,000 separate microservices running on the JVM, and each has its own set of operational challenges when it comes to operating at scale. One of the common issues that services face is becoming overloaded to the point where they are unable to serve traffic with reasonable latency.
There are a number of different ways to mitigate this; for stateless services that scale well horizontally, one of the easiest approaches is simply to overprovision the service so that it is able to handle the peak level of traffic expected. Other options are setting up circuit breakers to avoid backpressure from overloaded downstream services, or approximating a service’s capacity by pushing it to a bending point, then using that calculated capacity as a provisioning guide, or a threshold to start rejecting requests. Each of these has a cost however; overprovisioning increases hardware costs, circuit breakers only protect the client and not the service under load, and capacity approximations are only rough guidelines that don’t account for unexpected failures. At LinkedIn, we have used all of these approaches and more over the years, but until now we have not had a standard server side solution for ensuring good quality of service (QoS) for clients.
To address this shortcoming, we have developed a standard framework for our Java-based services that provides Holistic Overload Detection and Overload Remediation, aka “Hodor.” It is designed to detect service overload caused by multiple root causes, and to automatically remediate the problem by dropping just enough traffic to allow the service to recover, and then maintaining an optimal traffic level to prevent reentering overload. Since our services have disparate workloads, execution models, and traffic patterns, we needed to design for a wide range of different types of overload. The most obvious ones revolve around physical resource limits such as CPU and memory exhaustion, and I/O limits for network and disk access. There are also virtual resource limits such as execution threads, pooled DB connections, or semaphore permits. These limits may be exceeded due to increases in traffic to the service, though they can also be reached when latencies of downstream traffic increase, which can cause the number of concurrent requests being handled in the local service to increase with no change to the incoming request rate.
In this blog post, we’ll introduce Hodor, including an overview of the framework, an explanation of how it detects and mitigates overloads, and an account of how we deployed it safely to hundreds of existing services at LinkedIn.
The majority of LinkedIn’s services use our own Rest.li framework for communication, which is what we targeted first for adoption. There are a handful of other frameworks used, however, including the Play Framework and Netty, so we wanted to create something that was platform agnostic and not specific to Rest.li. There are three primary components in Hodor: overload detectors, a load shedder, and a platform-specific adapter to combine the two.
The overload detectors are responsible for determining when a service is suffering from overload and reporting the situation. Multiple detectors can be registered (including any that are application specific), and each is queried for every inbound request. This allows for detectors to operate on a request path or context-specific level if needed, and optionally shed traffic for a targeted subset of requests. Most detectors operate at a global level and so don’t do any request-specific processing, however, instead just fetching an asynchronously calculated state.
The load shedder is responsible for deciding which traffic should be dropped when a detector has determined that there is an overload (more on what happens to dropped traffic below). The load shedder needs to be intelligent about which requests should be rejected in order to ensure that too much traffic isn’t dropped, but that enough is dropped in order to provide relief to the service. This component takes as input the signal from the detectors indicating whether the system is overloaded, as well as some contextual information about the request, and returns a decision as to whether the request should be dropped. If needed, applications are able to supply a custom load shedder.
Finally, there is a platform-specific component, which converts any request-specific data into a platform agnostic format that the detectors and shedder understand. For most platforms, this can be implemented as part of the standard filter chain or a request interceptor. It queries the detectors, passes the result to the load shedder, and takes the recommendation from the shedder as to whether the request should be rejected or not. The request may then be rejected in a platform-specific way (for Rest.li we return 503s). These requests are safe for the client to retry, since they have not been handled by any application logic; the Rest.li retry behavior is discussed below.
Detecting CPU overload
When developing Hodor, we wanted to target the most common set of overload scenarios to start with, and then iteratively increase the scope to different potential causes. While we have encountered issues in the past due to network or disk I/O, they are not very common and are often due to bugs or poorly tuned service settings. Since Java’s memory management is handled by the garbage collector, we found that memory-related overload situations would often manifest as CPU exhaustion. For this reason, we decided to start with an overload detector for the CPU.
Determining when the processing power allocated to the application is exhausted seems like it should be a straightforward problem. If CPU usage is at 100%, shouldn’t that be the end of the story? After conducting a lengthy set of synthetic experiments as well as reviewing data gathered from our production services, we found the problem to be more subtle and complex than we initially anticipated. Instead of monitoring CPU usage directly at the OS/VM/container level, we found that measuring the ability to get CPU time from within the JVM provided a strong correlation with key performance indicators such as latency increases, CPU saturation, and GC activity. Measuring from within the live process provides a few benefits as well: it is architecture agnostic, so we don’t need to change anything moving from bare metal to containers to VMs, and it reflects the actual performance of the JVM, so we are aware of any safepoint pauses such as GC activity, programmatic thread dumps, or revoking biased locks. Safepoint pauses often affect application performance, but may not be reflected in traditional CPU measurements.
The approach we took to measure CPU availability is fairly straightforward conceptually. We run a background daemon thread in a loop and schedule it to sleep for an interval. When it wakes up, the actual amount of time slept is recorded and compared to the expected sleep time. The overall amount of work here is trivial, and adds no measurable impact to application performance. We collect these samples over a window, and then look at a certain percentile from the data to see if it crosses a threshold that is considered out of bounds. If too many of these windows are in violation over the given threshold of windows, we consider the service to be CPU overloaded. To provide a concrete example, we may have the thread sleep for 10ms each time, and if the 99th percentile in a second’s worth of data is over 55ms, that window is in violation. If 8 consecutive windows are in violation, the service is considered overloaded. Values for these thresholds that we use are determined by synthetic testing, as well as by sourcing data from production services and comparing it with performance metrics when the services were considered to be overloaded.
The description of the CPU overload detection given here, and the testing methodology to arrive at thresholds that are widely applicable, have been greatly simplified to fit into this overview blog entry. We plan on subsequent postings which will go into more detail.
Shedding requests when overloaded
Once we’ve determined that the service is overloaded, we need to provide relief by rejecting some requests. The “load shedder” referred to here is really a strategy that returns a recommendation to reject a request, which is handled by the platform-specific code. After experimenting with a few different approaches, the strategy that we found to work best in a broad array of traffic patterns and overload scenarios is based on limiting the number of concurrent requests that are handled.
We developed an adaptive algorithm that determines the concurrency limit based on feedback from the overload detectors, increasing or decreasing the concurrency threshold as needed. The initial concurrency limit is derived from a set of data tracking the past concurrency levels seen during periods leading up to the overload. Once an equilibrium point is reached, where the service is able to handle the given load, we periodically probe to see if we can handle additional traffic without reentering overload. If the detectors fire after the probe, we increase an exponential backoff threshold and revert back to the previous concurrency limit. We found this approach worked better than simply throttling a given percentage of traffic during our dark canary testing with fluctuating traffic patterns. With a concurrency-based approach, if traffic increases, the concurrency limit does not need to be changed; additional traffic is automatically dropped. Similarly, if traffic decreases below the level where it exceeds the determined concurrent limit, no requests will be dropped. With a percentage-based strategy, these situations are not handled well and lead to under-dropping or over-dropping requests respectively.
One thing to note here is that the threshold for the number of concurrent requests to allow is recalculated when an overload scenario is seen again. This may seem counterintuitive; if we’ve calculated the capacity of the service, why not reuse that threshold for subsequent overloads? There are two reasons for this, and both revolve around the dynamic nature of our services and the lack of assumptions that we can make about them. First, the traffic pattern to the service may change from one overload to another. For example, we may enter overload at one point due to a surge in organic traffic, but later may be pushed into overload because an offline or nearline job started running and is accessing resources different from regular site traffic. Second, the nature of the overload may be different—the effect on the service’s capacity from additional organic traffic may be much different from the effect caused by additional backpressure from downstream services leading to more inflight requests than normal.
Similar to the overload detection section above, this has been a very high-level overview of our experience developing and testing load shedding strategies. We plan on following up in more detail with a subsequent posting.
Obviously, there is a consequence to rejecting requests: someone is not getting what they expected. During periods of overload, it is necessary to deny some traffic to maintain the service’s health and liveliness, but we want to reduce the impact of shedding requests as much as possible to minimize the impact on site visitors. To facilitate this, the clients are instructed to retry a rejected request on another instance. This is safe to do independent of the request’s idempotency because it is handled in a filter chain or similar component, before any application logic has executed.
Retrying requests blindly can be problematic and lead to retry storms if an entire cluster of services is overloaded. To prevent this, we have several retry budgets in place, on both the client and server sides, using an approach heavily influenced by Google’s excellent SRE Book. Exceeding any of the budgets will prevent subsequent requests from being retried if they are dropped. If the server side budget is exhausted, it is because clients have signaled they have been retrying a lot of requests, which indicates a widespread issue that isn’t isolated to a small number of hosts. In this case, the server will stop instructing the client to retry until it appears the wider situation has been resolved. When this happens, clients will end up with failed requests, but we are prioritizing the goodput for the service. Dropping some requests to protect the service and handle the remainder is preferable to complete overload. Automatic retries have proved to be very effective for mitigating overload situations that affect a subset of hosts in a cluster.
Testing and rollout
As mentioned above, we made extensive use of synthetic testing to handle scenarios that were known to be problematic cases (such as increased traffic, added backpressure from downstream latency increases, different types of simulated application workloads, etc.), and we used dark canaries to see how things would perform with organic site traffic. These gave us confidence in our implementation, but we needed to be sure that Hodor would protect our services while not causing any harm by unnecessarily shedding load when it wasn’t warranted. The dark canary framework is extremely valuable because it uses regular site traffic to test the canaries, but it can’t be scaled out to every one of our services for testing. Our main concern for the initial rollout was that we do no harm by dropping traffic unless it was clearly necessary. To ensure this was the case, we deployed the overload detectors in monitoring mode to gather data about when they were firing, and if those firings were false positives.
With data flowing through our metrics pipeline, we were able to analyze the behavior for hundreds of services instead of the handful that we were able to test using dark canaries. We found a few interesting patterns, including that different application platforms had different levels of sensitivity, and that different versions of the JRE behaved differently. The two primary application platforms used for Rest.li are Jetty and the Play Framework, and it was interesting to see the differences in the effect they had on how the CPU detector was able to get processing time. In general, we found that in Play applications, overload would manifest as larger spikes of the detector’s thread not being able to get time on the CPU. In contrast, for Jetty applications, there were more subtle signs, and the larger spikes were not as common. This made it slightly easier to recognize overloads in Play applications, though this was not universally true, and depended somewhat on the nature of the workload. Similarly, we ran and evaluated different JREs, including those available from Oracle, Microsoft, and Azul. We found that Azul’s Zing in particular minimized scheduling delays across threads, so that the background thread was often not able to detect that the service was overloaded. We saw similar differences between OpenJDK builds from Oracle and Microsoft between Java 8 and Java 11, but to a lesser extent. Fortunately, we were able to come up with suitable threshold values that worked across different platforms and JREs, and minimized false positive signals.
To start onboarding applications, we began with those that have experienced overloads in the past. “Voyager” is the internal name of our flagship app and the associated services, and the team supporting it has seen a number of overload scenarios over the years. They built a similar QoS framework that operated using a fixed concurrency limit, and were interested in moving to something that could automatically adapt and didn’t need manual tuning. After testing high traffic rates and simulating downstream backpressure using LinkedOut with dark canaries, we were confident that Hodor would be able to provide improved protection over the existing, manually configured concurrency limits.
After onboarding Voyager, we found Hodor was providing value fairly quickly by reacting to overload situations that were not triggering the manually-set concurrency thresholds. Armed with these successes, we were able to gain interest from other teams for adoption and onboard a majority of our microservices, and are continuing rollout until all services are protected.
As mentioned above, one of the mantras for the project has been to do no harm and not cause unnecessary traffic drops. To this end, we have made service analysis and evaluation a core component of our rollout process. For services operating in monitoring mode that have not yet been onboarded, we correlate firing data from detectors with performance metrics to determine if there have been any false positives indicating that the service may not be a good candidate for adoption yet. The analysis is run over at least a week’s worth of data, which includes load testing scenarios. We have found that some services are not good candidates with our default configuration thresholds. These are almost always due to GC issues, some of which can be fixed with proper GC tuning, while others may require code changes due to inefficient memory allocation and usage patterns. Making these changes ends up being a significant win for the service, since their GC patterns improve, which leads to less CPU overhead, and they get Hodor to protect them as a side benefit.
Instances of our onboarding analysis discovering runtime issues that application owners are unaware of aren’t isolated to just poor GC activity. Adding overload detectors to our applications has surfaced unexpected behavior that owners are generally not familiar with. In one interesting example, the CPU detector was firing at regular intervals on a particular service with no obvious correlating metrics being impacted. Upon further investigation, we found that safepoint pause times were lasting several seconds, which is usually an indication that stop-the-world GC is running, though GC metrics seemed normal. Digging further, we found that the safepoint pauses were due to periodic thread dumps that were being triggered from application code. The team was unaware of this overhead, which was clearly impacting request processing at regular (though infrequent) intervals. In another example, we found that the JRE writing to GC logs caused large pauses due to excessive IO activity from other applications. This was not an unknown problem for our services, though surfacing it through Hodor was a passive way of discovering the issue, whereas normally this behavior would be more subtle and only uncovered after significant investigation.
Conclusion: Thanks Hodor!
Overload detection and remediation is a problem common to most services serving high traffic loads. We have chosen to tackle this in a holistic manner, providing an adaptive solution that works out of the box with no configuration for LinkedIn services. We hope some of the ideas shared here prove useful for teams facing similar challenges.
This has admittedly been a lengthy posting, and if you’ve made it to the end, congratulations. As mentioned above, it really has only been a high-level overview of the work done so far, and we plan to go into more detail about many of the subjects covered here in the future. There are also related areas of work that we weren’t able to even touch on, such as detectors for non-CPU overloads and traffic prioritization, which deserve a dedicated posting of their own.
This project would not have been possible without the dedication and hard work for multiple quarters by a wide range of talented people. A huge thank you to the core team: Winston Zhang, Sean Sheng, Abhishek Gilra, Salil Kanitkar, Bingfeng Xia, Nizar Mankulangara, Vivek Deshpande, Dmitry Shestoperov, Chris Stufflebeam, Bill Chen, Chris Zhang, Rick Zhou, and Deepak Bhatnagar. Additionally, a ton of gratitude goes to Ash Mishra, Goksel Genc, Ritesh Maheshwari, Eddie Bernard, and Heather McKelvey for their support and encouragement of the project. Special thanks also to Scott Meyer, Diego Buthay, and Manish Dubey for their guidance and feedback. Finally, thank you to Nina Mushiana and Rupesh Gupta for helping to review and refine this post.