Rethinking site capacity projections with Capacity Analyzer
March 16, 2021
While site outages are inevitable, it’s our job to minimize both the duration of outages and the likelihood for an outage to occur. One of our preemptive measures is in the way we determine overall site capacity and health on an everyday basis, in that we load-test in production. There’s an elegant system to bucket and route members to specific data centers from our edge nodes, which also provides an override capability in case we need to fail out of a data center—or, in our case for testing, fail out parts of almost all data centers and direct desired traffic to the target data center instead.
All of this is transparent to the member and all data is replicated and sessions persisted. We have sophisticated analytics systems in place that take various inputs and determine site queries per second (QPS), overall member traffic served out of all LinkedIn data centers, and can also provide projections for a quarter. Based on these factors, we set our load test targets to stay ahead of the curve.
However, a few years ago, we had unprecedented growth in traffic that broke our load testing models and were struggling to pass load tests across our production data centers. The site was stable and healthy for the immediate peak traffic it had to serve, but uncertain of taking the projected future load. We needed to rethink our capacity projections, and fast.
Where we started
This presented a unique challenge: how do we capture service regressions that don’t immediately cause outages but potentially could in the future? We set out on a long, strange trip to look for any signals that would tell us exactly this. Our solution needed to build a site traffic versus performance profile for services and their dependencies, and to project regressions at increased loads reasonably. The most immediate metrics we could think of that have a direct impact on member-experience were latency and errors.
We started with Little’s law and correlated overall LinkedIn traffic, also known as the Site QPS, and service traffic with service latency and errors. If the correlation was linear or sub-linear for a given Site QPS, we could be confident the service wouldn’t break during the load test. However, if the correlation was super-linear, it would.
We could detect anomalies on services during load tests using moving averages and simple deviations; however, these were very raw and noisy signals. We tried correlating these per-service anomalies based on the service interdependencies, which can be expressed as a graph (hence appropriately named the “CallGraph” at LinkedIn) to bubble up the backends that actually had issues (the cause) versus the frontends and mid-tiers that presented as issues, but were really just symptoms of the problem. We had relatively good success with these combined approaches that we iterated over quite a bit, until the law of diminishing returns started making itself very evident. It was time to explore the metrics from another angle.
Searching for the right signals
After promising initial returns with statistical analysis of service metrics, we started investing in machine learning to build more intelligent tooling. We took historical performance of applications and observed them when correlated with the site and service QPS; we immediately saw our MAP and Recall numbers soar, which helped further bring down the false positives. But, the solution still wasn’t perfect. Although we had a great generic anomaly detection and correlation framework, we weren’t analyzing all the right signals that would be early and right, most of the time. For example, almost all our production services are written in Java or similar JVM-based languages, and an anomaly in Garbage Collection time or count is highly predictive of an impending service regression.
We moved our focus beyond latency and started looking at other service metrics like threadpool utilization, CPU, Garbage Collection (GC) count and time, etc. We needed to be able to correlate all these anomalies on a service to determine with reasonable certainty whether the service was presenting symptoms of a larger problem, or if it was itself the root cause. To make sure we had the right test data for evaluating performance of each of the changes, we created a golden dataset, i.e., a dataset with over three months of validated capacity regressions during load tests.
We also invested in other techniques to greatly augment the service anomalies by analyzing their severity and assigning them scores based on their place in the CallGraph. For example, a backend service that five other services depend on and that powers critical member flows that then regresses is more critical than an isolated service powering a non-critical function. The techniques that primarily worked for us and ended up powering the tool are discussed in brief below.
Severity scores, with and without call path grouping
Latency Severity Scoring
An ML-based method to calculate severity scores based on latency, with this approach the endpoints are weighted by criticality learned from the CallGraph. The baseline latency as a function of Site QPS is calculated for all service endpoints. The observed latency is compared against the projected latency at the observed Site QPS and the severity is determined based on the regression from baseline to arrive at the Latency Severity Scoring for the service endpoint.
Call path grouping with leaf node detection
We knew that the services and endpoints weren’t isolated, but rather correlated via call paths. A call graph includes caller service, caller endpoint, callee service, and the callee endpoint—also called the call path. We started with the assumption that capacity issues of a service or endpoint may be caused by itself or its downstream services/endpoints; hence, it might be possible that the root causes of a set of service regressions could be identified by analyzing the call paths.
Briefly speaking, our approach had the following three steps:
Extract a sub call graph of the significant service endpoints which showed regressions that tells us the call paths between pairs of endpoints.
Find call paths with the largest severity scores.
Find the most significant endpoint of each call path and return them to members.
From analyzing various service regressions and looking at the golden dataset, we realized the probability of a service regression being a root cause when it’s a leaf node in the call graph is significantly higher. We factored this in while weighing the service regression severity based on call path grouping.
The final regression severity of various services and their endpoints was weighted based on the service criticality (i.e., the impact of a service in the overall LinkedIn services ecosystem) and its Latency Severity Score to rank and filter only the most critical service regressions that were immediately actionable. With this change, we were able to improve our top 10 service regressions by 71%.
Slope Change Filter
The Slope Change Filter is a technique to flag service regressions based on the rate of change of latency or other service metrics with respect to the increase in load applied to the service, i.e., the traffic it serves.
Calltime 95 (latency) vs Service QPS slope, before and in anomaly window
For each load test and service endpoint, we fit a linear regression model for data during the load test window and a linear regression model for data slightly prior to the load test window. To measure the difference between slopes, we define slope change (SC) as:
To measure the significance, we use a two-sample t-test and calculate the p-values (reference).
Generalized Severity Scoring
With Generalized Severity Scoring, we aimed to answer how metrics such as latency, error count, and endpoint QPS, along with other service metrics, could be utilized simultaneously to define the severity of a regression of a service endpoint.
We experimented with two sequential deep models for high dimensional time series for anomaly detection, the LSTM-AE model (references here and here) and the deep structured energy model. We ended up using LSTM, combined with other deep models, to detect anomalies.
The Generalized Severity Scoring, when applied with the Slope Change Filter, helped us improve our top 10 service regressions by almost 80%.
MetaRank: learning to rank root causes of capacity constraints from limited user feedback
The goal of designing a MetaRanker was to treat the ranking lists from all the above scoring and filtering methodologies as inputs and generate a new ranking list that would rank the root causes higher. The MetaRanker is supervised in the sense that it is trained using user-provided labels of root causes.
We collected data from 34 load tests that have at least one service regression labeled as a root cause. We used XGBoost to train the classifier, which consists of 100 trees, each of which has max depth 3. The top 5 service regressions improved by more than 150%, while the top 10 improved by almost 140% with the MetaRanker.
MAP & RECALL percentage for each scoring/filtering technique
With all said and done, we were able to bring up our load test pass rate to over 95%. Now, with every load test, we are able to identify services we determine will break at the next load test target QPS. We were also able to bring down outages due to capacity constraints 73%, with the capacity outages to load shift ratio more than halved.
While we are in a comfortable place right now with regard to detecting service regressions, we identified potential ways to increase the scope of this work by analyzing the impact of downstreams on service regressions for a more robust automated Root Cause Analysis (RCA), based on the service regressions caught by the Capacity Analyzer.
In the proof of concept development, we analyzed data from multiple load tests to attribute root causes and found the downstream impact analysis to almost impeccably detect root causes in more than 77% load tests; we had acceptable success in about 15%, and we were not able to satisfactorily detect root cause for only 8% of the load tests. The RCAs ranged from a service’s internal GC pressure to a faulty downstream to even single node failures.
Host-wise latency to detect outliers and single node failures. This graph shows four outliers from three hosts.
The work so far has been promising enough for us to continue investing effort into the downstream impact analysis development to enhance Capacity Analyzer and make major strides towards automatic service regression detection and root cause attribution.
Like anything complex and high quality, this project was built from the ground up by an amazing team of engineers in the Capacity Engineering and Machine Learning teams who were constantly supported by the previous and current managers. I would like to thank Anoop Nayak, Bhuvaneswaran Mohan, Binish Rathnapalan, Dong Wang, Mike Snow, Pranay Kanwar, Sanket Patel, Varun Sharma, and Vishnu C N from the Capacity Engineering team and Yang Yang, Yi Zhen, and Yungyu Chung from the Machine Learning team for their awesome work on this. I’d also like to thank the previous managers for Capacity Engineering, Jason Johnson and Rajneesh Singh, as well as the current managers Amal Abdul Majeed and Jacob Davida, for their continuous support. Finally, I’d like to thank Cyrus Dasdia for reviewing multiple iterations of this blog.