Introducing ThirdEye: LinkedIn’s Business-Wide Monitoring Platform
January 9, 2019
At LinkedIn, we have many different monitoring systems—each with its own role and granularity— ranging from quarterly reports about the business as a whole to the lowest levels of system-specific latency and availability. However, these systems don’t operate in vacuums—sometimes, issues or changes that are flagged by one system will go on to cause problems in another area. While system-specific monitoring is valuable and necessary, we also need to have an overall platform that allows us to see how the whole LinkedIn ecosystem is working in concert. Additionally, when issues arise, we need an integrated solution for real-time alerting and collaborative analysis. To solve this problem, we created ThirdEye.
ThirdEye is a comprehensive platform for real-time monitoring of metrics that covers a wide variety of use-cases. LinkedIn relies on ThirdEye to monitor site performance, track member growth, understand adoption of new features, flag sustained attempts to circumvent system security, and many other areas. ThirdEye provides a shared infrastructure for outlier detection and user-interactive data analysis of various system and business metrics. ThirdEye connects to a large number of data sources to gather information and learns over time to generate more relevant detection and analysis results through user interaction.
ThirdEye builds upon Apache Pinot, which recently entered incubation with Apache. For ThirdEye, we leverage Pinot’s awesome slice-and-dice capabilities to analyze high-dimensional data on demand and provide real time insights into the vast data sets of business metrics generated at LinkedIn.
A typical use-case for ThirdEye at LinkedIn is answering questions about deviations in growth metrics, such as member signups and page views, from various executives across the company. Small changes in growth can be attributed to everything from regional holidays, to minor configuration issues, to outages of entire data centers, so it’s important to have a single platform that provides visibility into all possible causes. Thanks to ThirdEye, we are able to supply potential root causes on-demand without needing to farm these questions out to a large number of specialized analysts and coordinate between multiple responses. Additionally, we’re able to bring attention to relevant outliers hidden underneath the surface, such as ongoing changes in the attention of LinkedIn’s members towards different sub-products that might cancel each other out in aggregate.
Over time, ThirdEye improves its automated detection and analysis capabilities from incremental user feedback and the addition of domain-knowledge and data sources. It provides common components out-of-the-box, and becomes more and more effective as different teams integrate their data and expand ThirdEye’s knowledge graph of system and metric dependencies.
Why another monitoring system?
Anomaly detection and root cause analysis are common problems for data science, site reliability, and engineering teams. One way or another, teams create monitoring solutions for their area of responsibility. These solutions are usually application- or domain-specific libraries, and don't generalize to other use-cases; they’re built around a very specific set of business rules for detection and analysis. Different teams sometimes redundantly spend large amounts of time to develop a particular solution. Each of these monitoring systems typically comes with its own data ingestion pipeline, limitations on processing capabilities and, of course, ongoing maintenance requirements. There might be a desire to consolidate multiple systems, but original systems don't scale to new use cases because they were never designed to process large amounts of data efficiently or support different detection methods.
An example of this was one of our teams taking several weeks to notice and diagnose a drop in page impressions in a specific part of LinkedIn’s feed. This issue was ultimately linked to a new security feature that was interfering with the timely serving of recommendations. The existing monitoring infrastructure was integrated with site performance tracking and, separately, the site performance team had integrated their monitoring with teams responsible for feature rollouts. However, these integrations used aggregated, top-level metrics and did not have an in-depth, end-to-end view of the system. Ironically, independent of the team’s ongoing investigation, one of their engineers was evaluating the costs and benefits of connecting their data feeds to ThirdEye. In the course of the evaluation, ThirdEye revealed the problem.
Architecture and design
We built ThirdEye from the ground up as a monitoring and analysis platform with a robust foundation in federated data processing across numerous batch and streaming data sources (Figure 1). ThirdEye leverages high-dimensional time series data from systems such as Apache Pinot and RocksDB for quantitative analysis, and integrates with numerous event data sources for correlation analysis and root-cause inference. All of this is done at user-interactive speeds and is suitable for real-time monitoring. Teams can easily connect their own data sources and then immediately leverage the entire pool of operational metrics at LinkedIn for detection, analysis, and implementation of team-specific business logic. This way, ThirdEye provides value to many different areas of business and responsibility, while centralizing the underlying knowledge base, infrastructure, and operations.
Another critical design aspect of ThirdEye is the tight integration of online monitoring and offline analysis capabilities. ThirdEye has real-time analysis features similar to MacroBase, and allows our users to investigate anomalies via dashboarding utilities comparable to Adobe Analysis Workspace, Google Stackdriver, and Amazon CloudWatch. Rather than treating analysis as an isolated feature, ThirdEye integrates analysis and detection as an iterative process. Our users explore data interactively while ThirdEye dynamically adapts to the user’s current focus to generate context-sensitive recommendations and detect outliers in potentially-related metrics and events on the fly.
Figure 1: ThirdEye’s 4 layer architecture enables extensibility for data sources and algorithms
Collaborative analysis dashboards
From our experience operating ThirdEye, we know that the trickiest problems arise at interfaces between different teams or systems at LinkedIn. A long-established approach to addressing such cross-domain issues in software companies is the formation of "war rooms," where knowledgeable engineers, operators, and managers are brought together to solve the problem at hand as quickly as possible. Besides the obvious time pressure and stress this brings, communication is still surprisingly inefficient because different teams use separate, incompatible data sources and systems to look at data.
ThirdEye decreases this fragmentation of valuable data by providing a collaborative analysis dashboard. This root cause analysis dashboard provides tools for charting and visualization that integrate data from different sources and enable our users to dynamically compose, edit, and visually correlate time series and events data. ThirdEye federates processing across different external systems while pulling together the most relevant results on a single platform. This creates a shared view for all stakeholders for efficient communication and analysis, while still maintaining detailed references to any external sources should the need arise to go back and dig further later. Finally, we automatically archive analysis results and generate post-mortem reports to minimize documentation overheads.
Over the past few months we also enhanced ThirdEye to provide dashboards for alert tracking and triage. Site reliability teams use this functionality to monitor the state of entire applications, which may be captured across many different metrics and alerts. If multiple alerts trigger simultaneously, alert dashboards allow users to maintain a structured overview of ongoing issues, investigations, and intermediate results. This also includes mechanisms to set up new alerts, manage existing alerts, and plug in business logic for team-specific workflows directly from the user interface.
Interactive root-cause analysis
At an organizational level, businesses rely on human judgement and accountability—and LinkedIn is no exception. ThirdEye implements both detection tuning and root cause analysis as an iterative feedback cycle between human and machine. The system repeatedly extracts “interesting” features and presents them to the user, who then chooses which aspects to investigate more closely. For example, when our users onboard new metrics for detection, they define the scope of metrics they are interested in monitoring, such as new signups coming from a particular continent. ThirdEye then extracts an initial set of anomalous ("interesting") data ranges and presents them to the users for feedback. As our users mark some of these anomalies as relevant or point out omissions, ThirdEye interactively adjusts decision boundaries of detection algorithms. This improves detection precision and recall, while users can add and re-adjust feedback over time. Of course, ThirdEye also supports plugable logic to seamlessly integrate with established procedures and allow our users to input their own domain knowledge and business rules.
ThirdEye’s root cause analysis similarly improves with increasing use (Figure 2). In addition to collaboration and dashboarding features, ThirdEye collects information about relationships between metrics, systems, events, and other entities while users work on their analyses and store results. As the library of investigation results grows, ThirdEye leverages past results to improve correlation and inference quality for future users. Additionally, we extract entity relationships from metadata available in various connected data sources to accelerate this learning process. For example, a drop in a business metric, such as page views, may be tied back to a specific continent based on a contribution analysis of dimensional time series data. This geographic information may then be used to determine a set of ongoing A/B tests in this region to identify probable causes. Going one step further, ThirdEye may then leverage information about recent software deployments to pinpoint faulty software versions or use holiday information to ignore the metric impact from specific countries. As more teams connect their systems to ThirdEye over time, the quality of root-cause inferences improves dramatically due to a network effect.
ThirdEye enables users to interactively explore data and incrementally refine analysis. In our earlier anecdote about feed impressions, ThirdEye helped engineers to work backwards from a drop in feed impressions to a correlated increase in drop rates, which in turn aligned with a spike in warnings generated by a specific security component. ThirdEye’s integration with LinkedIn’s deployment system immediately showed that new code had been rolled out just before the warnings started—and reverting this deployment indeed fixed the problem. Seamless support for this incremental and interactive analysis process was crucial to scale ThirdEye’s root-cause analysis to support the full scope of LinkedIn’s business and infrastructure.
Figure 2: ThirdEye incrementally acquires domain knowledge from user feedback which improves recommendations over time
Root cause analysis by example
The easiest way to showcase ThirdEye’s analysis capabilities is a visual demonstration. In our specific example, we find ourselves with a drop in the number of page views for a web application around Oct. 31 (Figure 3). We can manually adjust the time range of interest as we begin our investigation. ThirdEye possesses several automatic capabilities to suggest potential causes right from the start, already spoiling the surprise by hinting at a recent holiday in North America. Nevertheless, we can walk through ThirdEye’s various analysis tools to narrow the results down further and prepare a shareable report.
Figure 3: ThirdEye typically starts an analysis from a time series with a temporary outlier
We can leverage the heatmap view to determine which sub-dimension of our page view metric experiences the biggest impact. In our case, we find that views coming from the U.S. are down compared to a multi-week baseline. Clicking on the heatmap, we can select and filter sub-dimensions to add them to our time series view (Figure 4). Here, Pinot’s real-time aggregation powers really shine, with sub-second response times even for datasets at the scale of terabytes.
Figure 4: The dimension breakdown enables deep insights into business metrics on demand
Additionally, by changing the specific sub-dimension, or even metric, under investigation, we communicate to ThirdEye the shifting focus of our efforts. ThirdEye adapts scoring and recommendations dynamically, which is most visible for related events, such as holidays and deployments, but also applies to related metrics and datasets. In our example, we can view the list of related holiday events for our time frame, which squarely puts “Halloween” on top of the list (Figure 5). ThirdEye’s ranking algorithms are aware of proximity in time as well as other dimensional information, such as the geography associated with a significant outlier. It is these context-aware recommendation algorithms that extract information from the knowledge graph and enable both user-interactive and automated analysis.
Figure 5: ThirdEye ranks events surrounding the outlier automatically using dimensional information
Finally, we can polish our results and combine them into a shareable report with additional comments and references. Anyone can use this report to reference back to the events and data sources involved or even continue and refine the investigation. Our users inside of LinkedIn commonly use this functionality to build issue-specific dashboards that first help to analyze an issue, then monitor the recovery, and ultimately serve as archived post-mortems for future reference—and as a data source for ThirdEye’s ever-growing knowledge graph (Figure 6).
Figure 6: ThirdEye’s shareable dashboards also serve as post-mortems with user comments and documentation
Simplified platform adoption
Two central qualifications for the adoption of a monitoring platform are trust in its capabilities and integration with existing processes. During the development of ThirdEye, we were repeatedly reminded that, ultimately, humans bear the responsibility and have to answer questions about the correct functioning of a business unit or computer system. Therefore, transparency and support for plugable business rules are paramount, even when the system operates perfectly.
When new teams evaluate the suitability of ThirdEye as a monitoring platform, just presenting great performance numbers on black-box algorithms isn’t good enough. Should ThirdEye fail to detect an important anomaly, it would become the new stakeholders’ responsibility to explain why this happened. This need for transparency has spawned multiple efforts in ThirdEye to make the behavior of detection and analysis algorithms as transparent and predictable as possible, without major sacrifices in result quality. An example of this is the parallel use of detection algorithms and user-defined rules as a fallback. While algorithms operate reliably and improve with incremental user feedback, the fallback rules serve as an "insurance policy" to the user.
LinkedIn uses ThirdEye for monitoring numerous aspects of its more than 500M member platform. We spent tens of thousands of hours iterating on use cases, designs, implementation, and algorithms to battle-harden the platform and make it suitable for monitoring and collaborative analysis of a very large distributed system.
As always with efforts of this size, it took a village to get to this point. We would like to thank our active developer team, Long Huynh, Harley Jackson, Akshay Rai, Xiaohui Sun, and Jihao Zhang, as well as our marvelous UX designer Selene Chew and our relevance engineers Yung-Yu Chung, Kexin Nie, and Rouying Wang. We also thank Ravi Aringunram, Kishore Gopalakrishna, Shilpa Gupta, Madhumita Mantri, and Yang Yang for their leadership. We appreciate very much the contributions of our ThirdEye alumni—Thao Bach, Yen-Jung Chang, Steve McClung, Neha Pawar, and Yves Yuen—and our numerous collaborators inside and outside of LinkedIn over the past years; without your work and dedication, this would not have been possible. Finally, we would like to thank Kapil Surlaker, Deepak Agarwal, and Igor Perisic for their consistent support for ThirdEye’s vision from the executive level.