Spike detection in Alert Correlation
December 22, 2021
LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed infrastructure, finding the real root cause of the issue during an outage is like finding a needle in a haystack, even with all the right instrumentation. This is because every service in the critical path of a client request could have multiple active alerts. The lack of a proper mechanism to derive meaningful information from these disjointed alerts often leads to false escalations, causing increased issue resolution time. On top of all this, imagine being woken up in the middle of the night by a NOC engineer about a site outage that they believe is caused by your service, only to realize that it’s a false escalation that is not caused by your service.
To overcome this problem, we developed Alert Correlation (AC), which aims at improving the Mean Time to Detect (MTTD) / Mean Time to Resolve (MTTR) of an incident. Our goal was to find the root cause for a service breakdown within a given period and proactively notify service owners about ongoing issues, with a focus on reducing the overall MTTD/MTTR while improving the on-call experience. Alert Correlation is primarily based on alerts and metrics that are collected from our monitoring system, which gives us a strong signal of service health. By using our monitoring system, we can leverage the already-existing alerts and derive further alerts from them, which give us a strong signal-to-noise ratio.
Alert Correlation also utilizes another important service called the Callgraph, which is responsible for understanding the dependencies of a service. A callgraph is built using metrics that are already standardized at LinkedIn. For each dependency that a service has, all its downstream and upstream dependencies also have the same set of metrics, which are utilized to map the dependencies. The Callgraph is responsible for scraping the metric list from each service, looking for key dependencies for each of these services, and building a map of what the dependency looks like. It also collects and stores data like the call count, errors, and latency where it’s applicable. By utilizing the Callgraph, we can map the dependencies and identify high-value dependencies (i.e service A calling service B at a rate of 1000 queries per second (qps) is a high-value dependency, versus service A calling service B at 2 qps) and the associated metrics. We then use near-time analytics to find similar trending alerts between service metrics. During the time window of an issue, we can query the dependencies of a service that will result in a “confidence score” that represents how confident we are that a particular dependency is the problem.
Figure 1: Alert Correlation high-level architecture
The Alert Correlation service periodically polls our alerting database, called “Autoalerts” (Autoalerts is LinkedIn’s alerting system for user-defined alerts), to check active alerts in our infrastructure. Along with the Callgraph and the alerts data, we build a graph of unhealthy services and their dependencies, including active alerts (metrics exceeding set threshold) firing for individual services in the graph. The metric data points are compared with both upstream and downstream dependencies to derive a confidence score and a severity score. The confidence score denotes the probability of a particular service being the root cause. The severity score denotes the magnitude of adverse impact on the upstream services caused by the identified root cause. These scores are algorithmically derived, though the details of that implementation are beyond the scope of this post. A module in Alert Correlation groups upstreams affected by a common root cause and generates correlation results, also referred to as recommendations, that are shared with users via different interfaces like Slack, a web UI, and Iris (Linkedin’s internal notification system).
Figure 2: A Slack recommendation being posted for a service issue and its cause
Figure 2 above represents a notification shared with the service owner via Slack. This notification includes the following information:
The data center where the issue is observed, i.e., “Data center A”
The root cause service, along with the misbehaving endpoint, i.e., “Service-A” and its endpoint notifier_api
The confidence score, i.e., the probability of 0.81 for Service-A being the root cause
The severity score, measuring the magnitude of impact, i.e., 0.61
A list of impacted upstreams and the endpoints affected
Spike detection in Alert Correlation
LinkedIn’s services have evolved over time, and will continue to grow and become more complex, with additional infrastructure being required to back them up. Alert Correlation does an amazing job of pointing you to a potential root cause of the problem in the event that we run into a production incident.
Our alerts are generated by looking at the last 15 days’ metrics trends and deriving the standard deviation seen, which is commonly broad; at certain times, different teams configure their alert thresholds quite highly to avoid false positives. Historically-configured alerts often cause false positives due to recommendations being thrown by the alert correlation engine, since it becomes sensitive to alert data, due to the anomalies or spikes, as follows.
Figure 3. A spike in a service graph.
The above spike is from the metrics being affected by the anomaly causing the spike; in a production scenario, we have multiple metrics for a service being affected by such anomalies causing spikes. The spikes are generally short-lived anomalies that could be caused by a variety of reasons which might or might not be significant enough to be raised as an alert. The spikes indirectly cause teams to look into the posted recommendations for a downstream or upstream service and then to invest some time to conclude if this is a real problem or a false positive. This also increases the alert fatigue and overall toil for the on-call engineer, who must figure out if the alert is worth investigating. Hence, we wanted a way to detect these spikes in real-time and classify them as a real alert or just a spike. To get more accurate alert recommendations, we also used dynamic alert thresholds that get adjusted regularly based on the past trend of the alert, and these alerts served as more adaptive thresholds.
So, we needed a way to do anomaly detection that needed to be in real-time, computationally cheap, and stable enough to detect sharp spikes and ensure we have minimal false negative alerts. We came up with median estimation as our desired solution to detect outliers. Median serves as a robust estimation tool since it does not get skewed under the presence of a big outlier. We calculate the median of the past 30 minutes of alert data using a median estimation called Median absolute deviation (MAD). The median absolute deviation of a set of quantitative observations is primarily the measure of dispersion, i.e., how spread out the dataset is. By using MAD, we determine the median of the positive deviations around the median.
A simple example to find MAD on a set of data
We then use the above MAD along with the median in the modified Z-score algorithm proposed by Iglewicz and Hoaglin with an absolute value of greater than 3.5 to be labeled as a potential outlier. The modified z score is a standardized score that measures outlier strength, i.e., how much a particular score differs from the typical score.
Modified z-score, with x̃ denoting the median and 0.6745 as the 75th quartile of standard normal distribution where the MAD converges
Now that we are able to finalize the outlier detection methodology by using the modified z-square method, which doesn’t get skewed with sample size, for an impacted service we fetch metrics with active alerts. For each metric, we fetch metric data points for the last 30 minutes prior to when the root cause was identified using our metric framework (AMF - Auto Metrics Framework). Once we have the correct dataset, i.e., the metric data, we apply the modified z-score algorithm to each of the metric datasets, since we have multiple service metrics. We then finally clean, segregate, and group the classified data from each of the service metrics (which holds the outlier details) based on certain conditions like threshold and consecutive outlier data to determine whether it is a real alert vs a spike.
Figure 6. A real alert being identified by the spike detection algorithm
A spike or anomaly is basically an outlier in the dataset, versus a real alert, which doesn't stand out from the pattern (i.e., the alert metric dataset). We consider an alert to be a spike if it reflects the anomaly pattern along with some extra classification factors that we have identified for our use case, such as: how long the alert lasted (i.e., duration of alerts), the number of graphs for a service to process (including downstream and upstream), the confidence score, etc. All these factors, along with our user defined pre-filters applied on services, help us reduce the number of false negatives.
After applying spike detection on individual metrics for a service (i.e., on related graphs) by aggregation and grouping the results based on a window size of five minutes to identify a real alert, we can significantly improve the overall amount of recommendations that are posted to our Slack channel using the above-mentioned algorithm, up to the extent of classifying 36% of recommendations for an alert as spikes over an average period of a week. This simple approach created a predictable behavior for how anomalies are classified, without a lot of computational requirements and with the ability to be done in real-time, all while ensuring that we have a simple codebase to maintain. Moreover, we were able to reduce false negative results as well in the overall recommendation quality, with an accuracy of 99%. Currently, we have also integrated this feature not only for us via Slack recommendation, but also for our downstream clients, which use Alert Correlation data via the API endpoints.
This small yet significant integration was made possible by all the folks in the Alert Correlation team, right from the inception of the idea to final implementation. I would like to thank Sumit Sulakhe for his insightful knowledge during our initial design phase and code reviews, along with Vadim Nosovsky and Alex Phonpradith for carefully validating and verifying the accuracy of the results. Lastly, I would like to acknowledge the constant encouragement and support of my manager Amal Abdul Majeed in ensuring this endeavor successfully concludes.