Visualizing LinkedIn's Site Performance

June 13, 2011

At LinkedIn we emphasize on making sure the site is up and our members have access to complete site functionality at all times.
Fulfilling this commitment requires that we detect and respond to failures and bottlenecks as they start happening.
That's why we use time-series graphs for site monitoring to detect and react to incidents within minutes, and we literally put them everywhere.
Visitors to LinkedIn’s engineering offices cannot miss the ubiquitous array of wall-mounted displays showing real-time graphs. 

It’s obvious why discovering a problem because of a server going down is not ideal.

But there is more to the story besides early detection; historical comparisons are an important aspect of monitoring. Knowing a sub-system’s historical behavior makes elimination of red herrings easy and you get to concentrate on real problems. And charts and visualizations are great for interpreting and understanding complex data fast, especially when the data is generated by a multi-tiered and inter-dependent array of services where a problem surfacing on one end of the stack might have its roots on the other end.

But before I dive into what we have now, let me show you where we started. 

  • To begin we already had network and site health-monitoring tools in place but there was no efficient way to discover and alert on broken site functionality. 
  • We would be alerted in real time if a server went down but wouldn’t know if a particular service started misbehaving.
  • Data warehouse based reports lagged by 24 hours and product would inform us of suspicious metrics indicating issues a day after something went amiss. 

An obvious solution to this problem was putting a log analyzer into place and setting up alerts on error rates in the logs. We successfully did this with Splunk but we still had the following need gaps:

  • The amount of log data we needed to persist in order to provide useful historical base lines was getting unmanageable.
  • It was hard to correlate problems happening in different parts of the system since log files are largely disjoint.
  • Monitoring logs was a manual process that needed constant engagement.
  • Alert triggered emails quickly got overwhelming and hence engineers turned blind to them. 

Clearly we needed a supplemental approach.

Firstly, aiming at getting a maximum return on investment, we concentrated on re-purposing existing systems and technology and adding some in-house tools to extend functionality. These are the technologies that we decided to harness:

  • LinkedIn’s stack that primarily uses Java and Spring.
  • JMX instrumentation for health-checks used by load-balancers to route traffic.
  • Zenoss used for real-time network monitoring, which uses Round Robin Database in the background.
  • inGraphs, an internal web application as the front end for displaying the data from RRD files in the form of time-series graphs (implemented by Eric, one of our summer interns last year). 

Secondly, since these graphs aren’t monitoring network health but product metrics instead, there was no canned or standard template we could plug into all services. Each graph monitors a set of carefully chosen product metrics for a given service, and these needed to be an absolute measure of performance for that service. So we had to identify what metrics to capture for our different services and followed this approach for each service:

As a result, our inGraphs solution takes the following design:

  1. Code is instrumented with JMX to expose metrics and wire these beans into a Spring BeanPostProcessor. 
  2. A collector listens to the bean updates and dumps the data to RRD files. 
  3. inGraphs picks up the generated RRD files and renders the graphs. 
  4. The collector signals to Zenoss to send out alerts if any thresholds are crossed. We are careful here to make sure we do this after multiple samples to avoid being flooded by alerts from singular events.
InGraphs also adds the capability of designing custom dashboards using the YAML format that let us aggregate graphs from related services on a single screen. For example, this is a simple YAML:

  title: Basic Info Edited
    - label: updates
      service: profile-services
      rrd: profile-services ProfileServiceInfo  BasicInfoEdited_BasicInfoEdited.rrd
    - expression: updates,60,*
      label: updates_min
    consolidate: Aggregate
    overlay: weeks
    vlabel: updates / min
    height: 100
    width: 500
    overlayTime: 1
    noLegend: 0

The above definition produces the following graph on the dashboard: 

This monitoring technique has proven to be a great tool for engineers. It lets us move fast and buys us time to detect, triage and fix problems. 
The effectiveness of our monitoring system was highlighted in an instant where our inGraphs monitoring functionality tied to a major web-mail provider started trending downwards and the provider realized they had a problem in their system only after we reached out to them!

I strongly believe inGraphs is a great idea executed well and am glad that I get to look at graphs instead of mining log files when things go bump at night.