Monitoring the Pulse of LinkedIn

Jimmy Zhang

This is my profile picture from 2014.

November 30, 2015

Picture this: an engineer sits in front of a computer. Multiple terminal windows are open, actively monitoring a recent deployment. One window prints system stats in a continuous loop. Another runs a command that scans application logs for new exceptions. To the side, a large monitor displays a full page of graphs. The engineer, with eyes glued to the screens, is thinking…“there has to be a better way”.

In the medical world, an EKG test detects heart problems by measuring its electrical activity. LinkedIn has its own EKG, an internal tool that automates the analyses of ten thousand plus code and environment changes a month. To understand EKG’s role at LinkedIn, it’s useful to start with its origins.

LinkedIn uses the canary deployment model, an incremental deployment technique used to minimize the negative effects of new code changes. New code is first deployed to a single machine. This machine, labeled a canary, serves live, production traffic alongside other machines running the previous version. Once it’s been deployed, the canary machine is evaluated. If no issues are found, the new code version is fully deployed. Otherwise, the canary is rolled back to the previous version.

As the LinkedIn site uses a Service Oriented Architecture, service owners and SREs (site reliability engineers) share the responsibility of evaluating their service’s canary deployment. Not too long ago, this involved the scene described previously: sitting at a computer, manually gathering information. After a stressful and time-consuming process, a judgment call would be made, to either proceed with full deployment, or rollback. Meanwhile, LinkedIn continued to grow, adding new members, features, and markets.

EKG began with the simple goal of automating canary evaluation. The first version was a command-line interface that pulled application logs from canary machines and searched for new exceptions. Today, it is a full-featured web application, written in Python, Flask, and Ember.js that analyzes every canary deployment done at LinkedIn.

EKG increases developer productivity by replacing:

Manual processes with automated reporting
Inspection of graphs with objective measurements
Human judgment with explicit rules

Canary deployments at LinkedIn automatically trigger an EKG canary analysis. The analysis compares the canary to a control (a machine running the older code version and handling a similar load) by collecting metrics from the two machines for the 30-minute time window after deployment. These metrics, collected through Autometrics, Kafka, and other internal services, include:

System-level metrics used to evaluate the health of the machine (e.g., CPU utilization)
Application-level metrics used to evaluate the health of the service (e.g., HTTP error codes, exception rates, and counts)
Metrics to evaluate the health of the JVM (e.g., GC data)

The collected metrics are organized into categories, which are represented in the resulting report as tabs. Each tab contains charts, tables, and a set of rules that are run against the collected data. Rules compare the canary with the control and are designed to detect performance regressions.

An average analysis runs 8 tabs, evaluates just under 30 rules, and collects up to hundreds of metrics. As the rules pass, so does the analysis.

When an analysis completes, EKG sends an email to service owners with a PASS/FAIL result and a link to the full report.

At its core, a canary analysis compares the performance of two machines in a given time range. As the impact of automated canary analyses spread, we realized that this comparison could be extended to cover the other code and environment changes that happen daily.

For example, hundreds of A/B tests (known within LinkedIn as LiX experiments) are ramped every day, each of which play a direct role in shaping the experience for our 400 million worldwide members. During a LiX test, two versions of a feature are deployed simultaneously, and a metric is evaluated to determine which version is more successful.

LiX notifies EKG about experiments and includes list of impacted services. EKG then analyzes each service by comparing a typical machine to itself before and after the ramp. After the analyses complete, an email is sent to experiment activators and watchers, referencing the relevant analyses and experiments.

Today, EKG provides change-aware monitoring in support of LinkedIn’s commitment to experimentation and making data-driven decisions. With the help of automated reporting, developers can act with increased confidence, as the overhead to detect potential issues is greatly reduced.

Of the ten thousand plus analyses performed last month, EKG marked over 10% as possible failures. These changes are the heart of LinkedIn, and EKG is the system constantly analyzing its pulse.

EKG also provides continuous monitoring of services, as issues can arise after full exposure to production traffic. Once an hour, EKG examines each service at LinkedIn for new exceptions. If any are found, tickets are automatically created, and the appropriate owners alerted.

In the last month, EKG opened more than 3,000 such tickets.

Imagine that EKG has identified a new exception that your service is raising. How would you get more information? To help answer this question, we built the exceptions dashboard.

The report contains a chart graphing exception counts for your service against a time range, providing a clear view of the rate at which exceptions are being raised. Scrolling down, you see a table of different exception types, broken down by error level.

After locating the new exception, you can easily click to get more details.

The report contains information such as a sample stack trace, counts over the last day, and when the exception was first introduced into production. By using the report, you would have utilized the full capability of EKG, which first identified the new exception, alerted owners, and then provided actionable details about how to resolve the issue.

Providing change-aware monitoring is an ongoing process, and the EKG team is focusing on improving on two fronts:

Customization
Understandability

LinkedIn is composed of hundreds of services. Because the rules that run during EKG analyses apply to every service, there is some degree of overgeneralization. Services have different usage patterns and characteristics, and it is difficult to satisfy every services’ individual needs. The EKG and SRE teams are spearheading an initiative that enables service owners and SREs to define custom rules for their services.

EKG currently alerts end users that something is wrong; however, knowing that a particular outbound endpoint is returning 50% slower isn’t enough. Service owners need to know where to look next, what to fix, and how to verify solutions. To this end, the EKG UI is being improved to include actionable deep-diving links with rule failures.

Furthermore, EKG is in the position to extend the analysis of changes beyond performance metrics. Opportunities exist to assess changes in new dimensions, such as introducing unknowns to test resiliency, or to objectively measure resource utilization with cost analyses.

The Tools team at LinkedIn is dedicated to increasing developer productivity through automation. If solving these problems sound interesting to you, come join us!

Deepak Kumar and Michael Olivier were instrumental in establishing the vision and building the first versions of EKG. Badri Sridharan, Nick Baggott, Chris Coleman, Toon Sripatanaskul, Michael Chang, Arman Boehm and Melvin Du have all contributed to its development in recent years.

Special thanks to Hans Granvqist, Ritesh Maheshwari, and Zhenyun Zhuang for their help with this blog post.