Inception: How LinkedIn Deals with Exception Logs

Toon Sripatanaskul

Senior Engineering Manager @ Agoda | Fraud Detection and Prevention

December 16, 2016

Coauthors: Toon Sripatanaskul and Zhengyu Cai

In early 2012, the LinkedIn Performance team was trying to build a tool to validate the health of a service after code changes (a project that led us to build EKG, our canary-monitoring system). I was assigned to look into ways to use logs to analyze a service’s health. Back then, we had a script that copied log files from different machines, ran regular expressions over them, and then provided log reports. That system worked great at the time. However, LinkedIn was growing at a very rapid rate and the script was running into scaling issues.

Log messages, especially exceptions, are good service health indicators. One very simple way to validate health of newly-deployed services is to check if there is any new exception that shows up in a machine that is using the new code base, but not in the machines using the prior one. This sounds pretty straightforward to implement, but one of the biggest challenges for a company like LinkedIn is to scale this system for our multiple, large data centers. In this blog post, I will walk you through our journey dealing with this scaling challenge.

Inception, in the beginning

In late 2012, we created a system to process all log messages called Inception (LinkedIn Exception). We gathered log information by having services in machines across our data centers emit log messages into Apache Kafka. Inception used Kafka consumers to process each event and then put data in the following database tables.

Table name	Description
Unique Exceptions (What?)	For each log message, we extract part of the data—such as the stacktrace—then hash it. The generated MD5 hash represents a signature of the log message. In this way, we can quickly identify a new exception by just comparing the hashcode. It also serves as a way to de-dupe exceptions. Dynamic data, such as line number and timestamp, are excluded from calculating the hash. This table keeps log message samples and their corresponding hashcode.
Instances (Where?)	The table keeps information about where exceptions occur. It includes the hostname, service name, and version of the code.
Time Series (When?)	We cannot afford to keep track of every single occurrence of the log message. We instead put them in a one-minute time bucket. We keep track of the number of occurrences of a “Unique Exception” in each time bucket. This table also contains references back to the “Unique exceptions” and “Instances” tables.

Therefore, if we join the three database tables together, we should have enough information to generate log reports showing unique log messages and their corresponding count broken down by time and location.

Flow chart of Inception consumer

Inception: features and performance

At this point, it’s worth discussing alternative approaches to the problem of analyzing logs, which will illustrate the benefits of Inception. We could definitely store every log message in a log management system like Elastic Stack (ELK). ELK is a great tool that stores and indexes data and we use ELK here at LinkedIn. The biggest advantage of ELK is that you will never lose any information, as all the timestamp, actual occurrences, and actual exceptions’ stacktraces are preserved. However, this approach requires a much larger storage footprint. Inception, on the other hand, keeps only some representation of the log message and counts the number of occurrences. To give a rough estimate, keeping every service log for LinkedIn today would require over 50 petabytes of data. Inception only needs around 30 gigabytes of space. Creating log and exception summary report can be processed more quickly with a much smaller footprint.

Over the years, the Inception architecture has been refined several times. To keep up with the large scale, we’ve used profiling tools to identify performance bottlenecks. We also optimized databases again and again, and ended up creating Inception using mainly Python and MySQL. Back in 2012, excluding Kafka machines, we could fit the whole system for all service logs and exceptions at LinkedIn in a single machine.

An example of a log report of one of our services. It shows the timeseries of exception counts, unique errors and warnings, and their corresponding occurrence count.

Since then, Inception has been modified and extended in many ways. We hit several scaling problems and eventually decided to change the architecture to use Apache Samza. Zhengyu Cai and I started exploring ways to use Samza to process logs. We found that Samza is amazingly fast and highly scalable. Inception currently can handle over 1.1 million events per second. Furthermore, we also expanded Inception to a few more data sources.

Simplified diagram of log message data flow into Inception.

Server-side exceptions: We extended Inception to handle more languages for server-side exceptions. Inception now supports Java, Python, Scala, and C++ exceptions.
Javascript exceptions: We use a data pipe that collects JavaScript exceptions from members’ web browsers. Client-side exceptions are sent to our REST API server, where they get processed and converted to a Kafka event.
Mobile crashes: Mobile crashes have stacktraces just like backend services do. However, creating mobile crash reports is particularly challenging because we need to collect additional information, like mobile carrier and device information. There is also additional processing needed, such as symbolication. This is still a work in progress that needs to be fine-tuned.
Testing framework: We integrate Inception with our Selenium testing framework. For every test case, we have a unique test identification that propagates down to every downstream backend service. Inception can tell for any particular test case which downstream service(s) broke.
JIRA integration: We have a system to automatically create JIRA tickets when Inception detects a new exception. This is a proactive approach to alert our engineers about possible bugs. The system helps detect many problems early and prevents several major service failures. At the same time, we’ve also learned a painful lesson that the system can create too much noise when each unique exception is manifested as a JIRA ticket. We are working on a way to perfect the solution.

An example of a mobile crash report with an automatic JIRA ticket on the right-hand side.

This has been an incredible journey for us to be solving many large scale problems. As discussed above, Inception is a simple solution, but there is still room for improvement. We are looking forward to many more challenges to come. If you think these kind of problems interest you, join us!

Acknowledgements

We would like to thank the many engineers who helped build Inception over the years. We’d like to specifically call out Nick Baggott, Jimmy Zhang, Jeff Roger, Michael Chang, Arman Boehm, Ramanathan Muthukaruppan, Steven Pham, Jerry Weng, and Sreedhar Veeravalli for their contributions to the project and the managers like Brandon Duncan, Badri Sridharan, Hans Granqvist, David He and Thomas Goetze who oversaw this work. We’d also like to thank to Rajul Jain and John Bernado for creating the data pipe that collects JavaScript exceptions from members’ web browsers. Finally, a very special thanks to the MySQL-DBA team, Kafka team, SI Team, Samza team, and the Tools team—without them this work would not have been possible.

Topics: Optimization Open Source