Analyzing anomalies with ThirdEye

Yen-Jung Chang

ML Research Scientist at Facebook

February 20, 2020

Co-authors: Yen-Jung Chang, Yang Yang, Xiaohui Sun, and Tie Wang

At LinkedIn, ThirdEye is the backbone of our monitoring toolkit. We use it to keep track of a variety of metrics, whether it be related to production infrastructure and AI model performance, or business impact, such as page view or click count. It’s a key quality assurance system because it provides rules-based or model-based anomaly detection to reduce false alarms, and multiple interactive root cause analysis tools to help our engineers narrow down the cause of an anomaly. In fact, it has successfully detected several anomalies that could have otherwise slipped through the cracks and significantly impacted the member experience.

In previous blog posts, we have focused on the early steps of anomaly detection with ThirdEye: real-time alerting and collaborative analysis and creating smart alerts. However, alerts are only the first step when a user receives the notification. In this blog post, we will specifically focus on the behind-the-scenes functionalities of ThirdEye that analyze the multi-dimensional time series data and help our engineers understand why these anomalies happened through a dimension heatmap.

Data cube

In modern systems, data is usually aggregated or summarized by multi-dimensional information so that users understand the specific impact on different subpopulations. For example, when looking at total page views, business analysts typically will want to know how pageviews change across different countries, platforms, etc. The table below shows a hypothetical example of such breakdowns. This is known as a data cube, which enables users to slice data and gain a better understanding of its variations. At LinkedIn, such data cubes are pre-aggregated and stored in a real-time OLAP engine called Pinot.

Date	Version	Browser	Country	Platform	Member Page Views
2020-01-01	1.0.1	chrome	canada	iOS	100
2020-01-01	1.0.1	firefox	canada	Android	200
2020-01-01	1.0.1	safari	mexico	iOS	100
2020-01-01	1.0.1	safari	mexico	Android	300
2020-01-01	1.0.1	chrome	united states	iOS	600
2020-01-01	1.0.1	firefox	united states	Android	400
2020-01-01	1.0.1	firefox	united states	iOS	400
...	...	...	...	...	...

The table representation of a data cube

Dimension heatmap

The dimension heatmap provides a visualization of how the multi-dimensional metric changes when compared to a baseline and is one of the most popular root cause investigation modules used in ThirdEye. The baseline is usually selected before the anomaly period. By default, it is the same time period from the week prior.

The figure below shows an example of dimension heatmaps. The metric has 12 dimensions. Each row represents one dimension, and each cell within the row is a dimension value. The size of the cell is determined by the current dimension value proportion to the total traffic (e.g., page views). If there are too many small values, it will be grouped into “OTHER.”

A dimension heatmap with the filter, country="united states"

The color is decided by the difference between the current value and the baseline value for the same dimension value. If the difference is positive, then the color is blue; if negative, it is red. There are multiple ways to define what a “change” is and the impact it has. For example, if the metric value corresponding to some dimension value changes from 100 to 200, this is a pretty big change to itself, but if the total value is 1 million, then this change is small and will not be highlighted.

There are three ways to measure changes in ThirdEye:

Percentage change: The metric value change compared to its baseline for the cell.
Change in contribution: The metric value change compared to its baseline total for the cell.
Relative contribution to overall change: Compare the change of cell proportion to overall.

The tables below show examples of the calculation in the change metrics:

value	current	baseline
cell	5	10
total	50	120

Table 1.1: The current and baseline values of a cell and total

Percentage change	(5 - 10) / 10 = -50%
Change in contribution	5/50 - 10/120 = +1.7%
Contribution to overall change	(5 - 10) / (120 - 50) = -7.1%

Table 1.2: The metrics to measure the changes in the cell in 1.1

Rendering the whole heatmap only takes less than 1 second even for a very complex cube. We achieved this by leveraging the extremely low latency computation from the underlying Pinot store. The heatmap is also interactive: users can click on any dimension value and update the dimension heatmap. The selected dimension value will be used as the filter to slice and dice the data in real time. For example, Table 1.2 shows a heatmap with filter, (country = "united states").

Data cube algorithm

The dimension heatmap is extremely helpful for data exploration to determine an underlying issue within a specific dimension change. However, challenges still remain when faced with more complex metrics of multiple dimensions.

A data cube may contain 10 to 20 dimensions—this translates into millions of data segments. It is very hard to manually explore the heatmap to find out which dimension change contributes the most because the heatmap can only drill down one dimension at a time. If there are 10 dimensions and each has 5 dimension values, the manual process quickly becomes very tedious and prone to error.

To counter this, we developed an algorithm in ThirdEye to identify which data cube change is the most anomalous and rank the cubes according to their contributions. The figure below shows an example summary. By looking at the summary, we immediately know that (dimension1="Other", dimension2="All", dimension3="All") contributes most to the overall change.

An example summary generated by data cube algorithm

Selection of nodes from dimension hierarchies
In this section, we describe how to select nodes from dimension hierarchies, which are used to generate the above summary table. As an example, we will use an additive metric with three dimensions: continent, continent, and country. The image below shows the dimension breakdown of these cubes, which forms a tree structure. Note that this figure has omitted some nodes due to space constraints. For instance, we only show FR as a child of Europe with the children of FR omitted. In other words, every leaf of the tree is located at the third level.

The metric breakdown of baseline and current data cubes with continent, country, and state dimensions

In the above tree structure, the name near the circle is a dimension value for the corresponding level. When we drill down to the deeper levels of the tree, the dimension values are appended to the previous ones. For example, the node "North America" represents the data segments whose (continent = "North America"); the node US represents the data segments whose (continent = "North America", country = "US"). The root node does not have any dimension value because it is the aggregation of all data segments. Similarly, we can breakdown the metric by dimensions. Both trees show the metric breakdown of the baseline and current data cube. One problem remains: how do we find the nodes from the tree that have “significant” changes?

Defining change significance score
Let’s dive into how we evaluate the significance of the change for one node. In ThirdEye we calculate the impact of the change with three factors: change ratio, change difference, and segment contribution. Intuitively, change ratio measures how big the change is. Change difference measures the unexpected change compared to its parents. In other words, how much surprise the change is compared to its parent’s change. The segment contribution measures the physical impact of the node.

More formally, given the baseline and current values of a node n and its parent, the change significance score is calculated as:

where v_B and v_C are the baseline and and current node value, respectively; r is the expected change ratio between the baseline and current from its parent node, which is defined as r = (v_parent_C) / (v_parent_B); contribution_C is the contribution of the current node; and contribution_all is the overall contribution. For additive metrics, the contribution can simply be calculated as (v_B + v_C). For ratio metrics, it could be simply (Denominator_B + Denominator_C + Numerator_B + Numerator_C) or calculated from an additional additive metric.

Roll up significance scores
The importance of a dimension d (e.g., country) is defined as the sum of the significance score of all its children (e.g., (country="US"), (country="FR"), and so on). Formally, the dimension importance of d is:

where m is children count of dimension d. Afterward, the dimension importance is used to determine the tree structure in the figure below, i.e., the root level is the most important dimension. Finally, each parent node picks the top k (i.e., the summary size) children nodes and we merge the result of parent nodes from bottom to the top; each merge operation keeps only the top k nodes.

Present data cube changes
Finally with the nodes selected, we can summarize the change using a table. The input of our original problem is two data cubes, in which each row is a node in the 3rd level of the tree.

The change delta and selected nodes between the data cubes

Suppose that the above diagram shows the rolled-up results of our baseline and current data cubes, in which the bolded nodes are selected during the rolling up. The following shows the table to represent the summary:

Continent	Country	State (Province)	Baseline traffic	Current traffic
OTHER	(ALL)	(ALL)	55	77
Europe	(ALL)	(ALL)	25	43
N. America	CA	(ALL)	10	23
N. America	US	OTHER	15	32
N. America	US	Calfiornia	15	25

The table representation of the difference summary

A node of the tree in the above diagram is either a data segment (leaf node) or grouped data segments (non-leaf node). In the table, the row of a leaf node contains all three dimensions. The row of a non-leaf node contains only one or two dimensions and the remaining dimensions are either (ALL) or OTHER. (ALL) indicates all its children nodes are grouped together, while OTHER means that some children are not included. For example, the node (N. America, US, OTHER) does not include the child (N. America, US, California).

Success stories with the data cube algorithm

Explaining a drop in business metrics
LinkedIn's data science teams have adopted ThirdEye to analyze the change of business metrics, such as member page views, job views, etc. Depicted below is the summary of an example incident in which we saw traffic increase by about, say 3 percent, from the previous week. The algorithm observed an unusual traffic shift from {UserInterface = "ios"} to other user interfaces; After investigation, we found that a tag for the traffic was missing from the new version of data manager inside iOS app, which caused the traffic to shift away from "ios".

A mock up for the difference summary of an online serving issue

Explaining online model issues
As a unified anomaly detection platform, ThirdEye is also used for AI model monitoring. It provides the health assurance pillar for LinkedIn’s machine learning infrastructure. Many consumer-facing recommendation systems, such as News Feed, People You May Know on LinkedIn, must be closely monitored so that our engineers can be informed of any model issues occurring in production.

For a recommendation system, O/E metric (i.e., observed business objectives/expected scores from models) is used to measure the model performance. As an example, if the objective is marked by the click-through rate on the recommendation, it can be defined as follows:

Any deviation of an O/E metric change against the baseline would indicate issues around model accuracy issues, and an automatic dimensional drill down onto the dimensions can help AI engineers to quickly identify the most degraded subpopulation and speed up the investigation process. The tool has successfully identified several production issues including model issues, feature issues, or even upstream model issues.

A mock up of the diff summary of an online serving issue

This table shows the summary of a recent online model serving issue that was happening on multiple models in the first level. We can also see that a certain value of "Activity_type" dimension kept being called out across different "Ranking_model_id" in the second level. This algorithm cannot find any useful information after drilling down the third dimension; thus, it rolls up to the second dimension automatically. From this summary, the production team quickly narrows down the issue is related to a particular "Activity_type", e.g., like. Finally, they successfully find an activity tag, which is used to classify the traffic, that is wrongly assigned to a different value.

Conclusion

In this blog, we introduced the root cause analysis tool that can be used to explain the data cube changes on ThirdEye. The heatmap provides a visualization of how the multi-dimensional metric changes, while the data cube algorithm automatically explores the data and determines the underlying issues with dimensional changes. The data cube algorithm has successfully helped business analyst, operation, and production teams across LinkedIn to narrow down the root cause of the issues they are investigating. The code is also available on GitHub.

We would like to acknowledge the support and contributions from Kishore Gopalakrishna, Ravi Aringunram, Jiashan Wang, Sabeer Thajudeen, and Kexin Nie in developing the data cube algorithm on ThirdEye. Finally thanks to Bo Long, Kapil Surlaker, Deepak Agarwal, and Igor Perisic for their continued support.

Topics: Analytics Open Source Data