Grokking Usage of API Data: Augmenting LinkedIn User and Partner Experience

October 14, 2015

LinkedIn’s API is the gateway used by numerous applications to access LinkedIn data, from simple third-party apps to large-scale strategic partners. It plays an important role in opening LinkedIn content to the outside world and is closely tied with numerous external and internal applications. Given the API’s importance, we keep a close eye on its overall health, to make sure we catch any potential issues before they impact our partners. We do this by monitoring a number of metrics around our API – including errors, increased latencies, abnormal call counts and bad response codes – to make sure it functions properly.

While our normal monitoring has enabled us to catch much larger issues relatively quickly, it still has several limitations. Our goal is to catch issues before our partners do, and the monitoring framework does not allow us to do that for all of our partners. Instead, this information is aggregated within general service metrics. If an issue is limited – only affecting a specific API call, application or partner – we still would like to detect and resolve these issues as soon as possible, ensuring the best possible customer experience. In addition, when an application’s call count reaches a specified threshold, LinkedIn’s throttling mechanism can get triggered , blocking a legitimate increase in call count. This increase in call count could be due to organic growth or a new product being released, and we would like to know if these thresholds are reached, especially for our partners. Dropping legitimate requests is a terrible user experience and needs to be improved. We want to do everything we can to avoid such an experience.

API-Analyzer was developed in response to these needs. It helps us achieve near real-time monitoring, slice and dice API data during troubleshooting and run queries on API data, which we were not previously able to do in a timely manner. It has completely transformed the way we troubleshoot API issues. To understand how API-Analyzer has fundamentally changed how we go about troubleshooting API issues in addition to changing our partner monitoring, we need to start by explaining the history of monitoring our API at LinkedIn.

We understand the importance of monitoring the health of our LinkedIn API on a per-application and per-API call level. Having more granularity enables us to monitor at the application level in addition to providing an easier debugging mechanism for a reported problem. During troubleshooting, we should be able to do much deeper analysis to locate an issue.

To achieve this, Bryce Jasmer, one of our Site Reliability Engineers (SREs), wrote a tool called Partner Access Layer Tracker (PAL-Tracker) back in 2011, which laid the foundation for application level monitoring. The tool listened to API related Kafka [1, 2] events and emitted the processed data back to LinkedIn’s monitoring framework [3, 4] enabling us to set up inGraphs [5] and alerting.

We had several different clusters serving the API and some of these clusters were only used by LinkedIn’s mobile applications. Despite that, all our clusters emitted Kafka events which had the same topic name. At the time, our API usage was at a very early stage and LinkedIn only had one data center, which is why the original tool was sufficient enough for our needs. With our ever growing API traffic - not to mention increased number of data centers - the tool quickly ran into severe performance and scalability issues and was unable to keep up with increasing traffic. As a result, data from the tool ended up lagging drastically, affecting our ability to detect issues in near real-time. The inability to filter out Kafka events for specific API clusters made the issue far worse and the data coming from the tool became obsolete.

We had another tool at our disposal called Fuse-tracker which tracked application level throttling. Fuse-tracker allowed us to monitor the state of throttling and prevent unnecessary partner blocks. Same as above, increased API growth and its limited scalability made maintenance of Fuse-tracker difficult which hindered its usage.

Furthermore, both tools had different data sources and adding a new partner required multiple files be changed. Adding alerting involved a tedious process that consumed considerable manual intervention from our SREs. Adding a new partner to just these two tools in addition to creating alerts took about 30 to 45 minutes of an SRE's time.

Considering all the limitations and challenges we faced with PAL-Tracker and Fuse-tracker - specifically in regards to scalability, usability and maintainability - we decided it was time to create a new tool containing all the existing functionality along with some additional features, but in a more simple manner.

Our goals were as follows:

Consolidate PAL-Tracker and Fuse-tracker functionality into a single tool.
Offer more granularity of Kafka events coming from specific API clusters.
Make the tool independent of the number of data centers, partners, or alerts within LinkedIn.
Store the consumed Kafka data in a cache for future processing.
Provide a web UI to perform predefined queries, obtain results in near real-time, and provide a graphical representation for queried data.
Create a single source of truth when managing partner monitoring.
Emit partner specific metrics to Autometrics Framework [3] allowing data retention for an extended period in addition to giving the ability to set up alerts.
Automate monitoring and alerting reducing turnaround time.
Integrate into LinkedIn's standardized deployment framework to ease management, development and scalability while allowing others to contribute and expand the tool.

To achieve these goals, we began developing API-Analyzer near the end of Q3 2014 and started to use it in production in early Q1 2015. We have since progressed to the current state of API-Analyzer which is architected below.

Here is the process flow;

Our API and Fuse-throttling mechanism emit Kafka events using a specific Kafka topic.
API-Analyzer spawns multiple Kafka consumers and each consumer writes to a queue.
Events in the each queue is parsed by multiple Python processes. During parsing we extract predefined fields (application ID, response code, call type, IP address, etc) and create multiple combinations along with a timestamp for the request.
These are then used as a key and combination pair to increment a counter value in our backend Redis cluster. We do make use of a Python buffer before writing to Redis in order to decrease CPU utilization in the Redis backend.
API-Analyzer has several modules that consume data stored in Redis based on various needs:
1. Dynamically obtain a list of call types for a specific application
2. Map an IP to an application ID
3. Process stored data for partners and emit to LinkedIn's monitoring framework.
API-Analyzer incorporates a Flask-based web application running predefined queries to provide a graphical representation of the data in addition to resetting any throttling alerts.

API-Analyzer was launched at the end of 2014 and has since become an integral tool to maintaining LinkedIn's API. Network Operations Center Staff, SREs, API engineering and developer relations teams are all currently users of the application, helping everyone save time when troubleshooting; below are a few examples of how:

We were able to improve the alerting and monitoring portion of the partner onboarding process to a self-service model instead of the old, multi-day turnaround time. Setting up monitoring and alerting used to take about 30-45 minutes of SRE time and now SREs can focus on other projects during the day.
We used to have to run a Hadoop query whenever we wanted to lookup an IP to see which apps were being called from. This was a slow process that usually took 10 to 20 minutes. By using API-Analyzer, we are now able to match application IDs to IP addresses dynamically, and get the information within milliseconds from Redis. This especially helps during a production issue when we need to find this information as fast as possible to resolve any pertaining issues.
Similarly, finding API calls made by applications used to be very time consuming. Now, we are able to dynamically locate this information, and reduce the lookup time to the order of milliseconds.
Some partners have dynamic IPs from which they call the LinkedIn API and we need a systematic way to track and ensure we are not blocking these addresses. API-Analyzer is currently being used to confirm whether or not traffic from an IP is partner traffic; we will soon be opening up this information via an API for internal applications to use.

API-Analyzer has given us a faster and more flexible way to access API related data and has already started to play an important role in monitoring and troubleshooting LinkedIn's API framework. Our long-term goal is to make API-Analyzer the go-to tool for any API related issue, presenting a unified framework to both API SRE and the developer relations team when maintaining our API.

Our long term goals currently include:

Add additional modules such as one to dynamically correlate call type to IP addresses hitting our API.
List statistical information on specific applications and IPs while graphical representation portrays trend information.
Revamp the user interface to improve the overall user experience.
Develop API end-points allowing other internal applications to access API-Analyzer data.

By cutting down task execution time and improving overall API efficiency, API-Analyzer is already saving NOC Staff, API and partner engineering teams as well as SREs valuable time and we look forward to watching it grow over the coming months and years.

Special thanks to the following for their contributions along the way:

Bhaskaran Devaraj, Xiao Li and Tao Cai who helped formulate the idea and assisted in implementation.
Bryce Jasmer on the initial tooling of both PAL-Tracker and Fuse-tracker.
Chris Carini and Anurag Bhatt for their feedback and reviewing.

[1] http://kafka.apache.org/

[2] https://engineering.linkedin.com/kafka/kafka-linkedin-current-and-future

[3] https://engineering.linkedin.com/52/autometrics-self-service-metrics-collection

[4] https://engineering.linkedin.com/metrics/scaling-collection-self-service-metrics

[5] https://engineering.linkedin.com/32/eric-intern-origin-ingraphs

Topics: Developer Experience/Productivity Automation Infrastructure