Smart alerts in ThirdEye, LinkedIn’s real-time monitoring platform
June 3, 2019
Earlier this year, we published a blog post sharing details on ThirdEye, LinkedIn’s comprehensive platform for real-time monitoring and root cause analysis. LinkedIn relies on ThirdEye to monitor a wide spectrum of metrics, ranging from product health to high-level business metrics, and also including machine learning model performance.
While monitoring is a crucial aspect of this system, setting up the right alerts is critical for receiving meaningful notifications. However, it is not easy to set up alerts with both high precision and good coverage.
First, metrics are very different from each other. Different metrics have different time granularities, ranging from month- to minute-level. They have different patterns—some with trend and seasonality change, and some stable, with no change. Some have different dimensions, while some have no dimension at all. There is no one-size fits all solution for these heterogeneous metrics.
Also, what constitutes an anomaly is quite subjective; some users are more sensitive to small fluctuation, while some only care about big changes. All of these characteristics necessitate that the alert system be flexible enough to meet a variety of user requirements.
In this blog post, we describe how we tackled the challenges described above and used a customized detection and notification flow to reduce noise and to get meaningful alerts from ThirdEye.
Introducing Smart Alert
We redesigned the alert pipeline in ThirdEye and called it “Smart Alert.” Now, the anomaly detection system is smart enough to work end-to-end from loading data to detecting anomalies to sending notifications. It mainly includes the following two parts:
Detect the right anomalies: Only detect anomalies users care about.
Select the right data: Filter out data that’s not relevant or that is too small (and therefore very sensitive to changes). Handle data with different time granularities and multiple dimensions.
Choose the right algorithm: Have the flexibility to choose from a variety of detection algorithms. The algorithm should be smart enough to learn underlying data patterns and decide the best parameters for alerts.
Reduce noise: Avoid over-alerting. For example, if anomalies with short durations keep happening, they should be merged into a single, large alert.
Send the right notifications: Send notifications to the right people at the right time.
Customize recipients: Send to different groups of people based on different anomalies.
Customize channel: Send notifications through multiple channels like email, SMS, or phone call according to configurations.
Customize frequency: Have the flexibility to configure the notification frequency. Be able to temporarily suppress alerts to avoid sending too many notifications, e.g., only alert on big changes during holidays because the metrics are more likely to fluctuate at that time.
In ThirdEye, we have separated the alert system into two flows: anomaly detection flow and notification flow. Each component of the flow is designed to be configurable and pluggable to achieve the most flexibility.
For the rest of this blog post, we will introduce the flows in detail and demonstrate how users can configure the flows using YAML configuration and set up smart alerts. We are also building a UI to make setting the configurations easier.
The detection workflow is responsible for detecting anomalies from time series data, which comes from different data sources. The main data sources ThirdEye uses are Pinot (real-time OLAP datastore) or inGraphs (time series database). Data sources like MySQL and Presto are also supported. Users can build a connector to connect to other data sources as well.
Anomaly detection flow
Filter is the first step of the detection flow, with each rule containing the dimension name and a list of dimension values. Multiple filters could be combined together; e.g., a user can set two filters as “is_valid_traffic=true” and “browser in [chrome, firefox, edge].”
Dimension Exploration is a more powerful filter. Unlike the filter above, it is a dynamic filter, which means it will first pull the data, and then apply the filter dynamically. It is useful for multi-dimensional metrics where dimension distribution may change over time or where it is hard to enumerate all possible values. For instance, users might want to monitor traffic on different countries and platforms, but the full combination may be too large and noisy. Instead, they can choose to check only the top k combinations based on traffic size. Below are the rules users can set:
minContribution: Only monitor the dimension combinations that have more than an X percent contribution on the overall metrics.
k: Only monitor the top k dimension combinations that contribute to the overall metric.
minValue: The aggregate value of this dimension combination must be larger than the threshold.
This is the actual place where anomaly detection happens. Multiple rules could be combined to form a more powerful rule, if desired. Multiple rules are combined as “OR”, which means the anomaly is detected if any of the rules is met.
PercentageRule: Check the metrics’ week-over-x-week change. The baseline could be last x week or last x weeks’ average/median value.
ThresholdRule: Detect an anomaly if the metric value is below/above thresholds.
HoltWintersRule: Use Holt-Winters algorithm to forecast and detect anomalies.
AlgorithmRule: A set of sophisticated machine learning based algorithms that can catch traffic patterns, like trends and seasonality, automatically.
In addition to these built-in rules in ThirdEye, users can also define their own rules and then plug them into the detection flow.
These are another set of filters that work in tandem with Detection Rules. Since what constitutes an anomaly is quite subjective, users may want to filter out small fluctuations even if they are marked as anomalies by Detection Rules.
PercentageChangeFilter: Filter out anomalies with a change of less than X percent.
ThresholdRuleFilter: Filter out anomalies with current value below/above threshold.
AbsoluteChangeFilter: Filter out anomalies with absolute change value below/above threshold. Compared to ThresholdRuleFilter, it focuses on “absolute change” but not the actual “current value.”
DurationFilter: Filter out anomalies with a duration less than X. It is useful to filter out small spikes.
SiteWideImpactFilter: Filter anomaly based on the change ratio to global metrics. E.g., if Firefox’s traffic is 10% of global traffic, then a 5% change in Firefox is 0.5% change globally.
Merge small anomalies to avoid over alerting. In the real world, anomaly detection happens continuously in mini-batch style, and we need to merge anomalies from consecutive batches to avoid generating too many anomalies. Two parameters could be set to control the merger:
maxGap: The gap between two anomalies to be merged. If the gap between two anomalies is less than this value, they will be merged.
maxDuration: The maximum allowed duration of a merged anomaly.
Grouper is a more advanced merger where a user can define customized logic on how to group multiple alerts. When an anomaly occurs, there may be hundreds of metrics affected. Grouper allows a user to cluster the anomalies and only send the most important alerts.
The notification workflow takes the anomalies detected in the anomaly detection flow and sends them out in batch. The purpose of separating the notification flow from the detection flow is to decouple the components to achieve the most flexibility. Fore example, users can set up a notification flow that triggers every five minutes to catch anomalies as soon as possible, and then create another notification flow to send a summary report every day.
It is useful to reduce noise during maintenance windows, deployments, or holidays. During a suppression period, users can opt to receive only the most severe alerts by setting the following parameters:
expectedChange: The expected change of the metrics.
acceptableDeviation: Combined with the expectedChange, this is the threshold for sending alerts. For example, you could set a 30% expected change and 10% acceptable deviation. Under this rule, alerts will be suppressed when the metric change is within 20% and 40%.
This is a flexible way to send notifications based on dimension values. As an example, you may want to send alerts to the Android oncall team if the anomaly happens on the Android platform, but send alerts to the iOS team if the anomaly is only for the iOS platform.
Alerting schemes define the notification channel. ThirdEye supports email alerts and Iris (service for paging and messaging) alerts for SMS or phone calls. More complex rules could be configured in Iris.
Let’s walk through an example of how to set up alerts on a metric. Below is a multi-dimensional metric we want to monitor:
There are five dimensions: “date,” “isValid,” “browser,” “country,” and “platform.”
There is only one metric, called “traffic,” which tracks the traffic count on the dimensions.
Here are the example alert configurations that meet the requirements below:
Filter out invalid traffic, like from bot users, with “isValid” = “true.”
Only monitor traffic change on “chrome”, “firefox” and “safari” browsers.
Monitor all countries and all platforms. But since the combination is high, we want to monitor only the combinations that contribute to at least 5% of the total traffic.
Use Holt-Winters algorithm to detect anomalies and get alerted when traffic changes either up or down.
Filter out small changes (less than 5%) even if they are flagged by the algorithm.
Send alert emails to on-call teams according to platforms.
Even with all the comprehensive configurations in hand, it is still not always intuitive to understand how the anomaly detection works with real data. With preview, it is easier to understand the detection performance.
In ThirdEye, users can choose a time window and run the detection flow to get anomalies detected during this window. Usually they can try different settings and preview the results until satisfied with the performance.
Preview alert performance
In this blog post, we showed how to set up smart alerts in ThirdEye—specifically, how to customize “anomaly detect flow” and “notification flow.” With all the customized modules, a user can set up end-to-end smart alerts to monitor their metrics. Preview gives users confidence when defining the rules. Giving control back to users in this way can improve transparency and help to build trust.
We are working actively to improve detection algorithms and scale the platform to make ThirdEye smarter and more powerful. In addition to the YAML configuration, which is tailored for advanced users, we are also working on a UI to make it easier to change the settings.
We would like to thank our awesome engineers who are working hard to make ThirdEye better: Akshay Rai, Jihao Zhang, Kexin Nie, Yung-Yu Chung, Rouying Wang, Yen-Jung Chang, and Harley Jackson, as well as our wonderful technical program manager Madhumita Mantri and UX designer Oscar Bejarano. Also we would like to thank Tie Wang, Ravi Aringunram, and Yang Yang for their leadership and guidance. Finally thanks to Bo Long, Kapil Surlaker, Deepak Agarwal, and Igor Perisic for their continued support.