Reducing the MTTD and MTTR of LinkedIn’s Private Cloud

Gustaf Helgesson

Staff Software Engineer, Media Infrastructure at LinkedIn

October 14, 2016

Nuage (French for “cloud”) is what we call LinkedIn’s internal cloud management portal. It allows LinkedIn developers to quickly create new datastores like Kafka topics, Voldemort stores, and Espresso databases, to name just a few in the LinkedIn data centers. The product consists of an HTTP frontend available to LinkedIn developers and a rest.li backend service, which talks to the underlying systems, such as Kafka.

Similar to the underlying storage and streaming systems at LinkedIn, Nuage is itself also a distributed system, with several frontends and backends distributed across multiple data centers. As anyone working on distributed systems can attest, debugging these systems can be both difficult and time consuming. For a given incident, one often has to find the relevant machines for each service involved and get the logs from each of these machines for the relevant timeframe to then digest and analyze. In the case of Nuage, this translates into finding the relevant data center, the relevant frontend and backend machines within that data center, and getting the logs from these for the time of an incident.

To assist with debugging and reduce the mean time to recovery (MTTR) of the Nuage system, we perform log aggregation, request tagging, and user experienced exception reporting. We will cover our log aggregation setup later in this post, but first let us cover the request tagging and user experienced exception reporting, as this is then leveraged in our log aggregation setup.

Request tagging and exception reporting

As outlined above, Nuage consists of frontend and backend services running on different sets of hosts. To simplify debugging, we want to be able to easily correlate events in the frontend with events in the backend. To solve this problem, we generate a random request ID in the frontend that is passed on as an HTTP header to the Nuage backend. The frontend logs the request sent to the backend (including the request ID) along with the response it receives. Upon receiving an incoming request, the backend looks for the presence of the request ID in the HTTP header and adds the request ID to every associated log line. This latter part is accomplished using log4j’s Mapped Diagnostic Context (MDC). MDC is effectively a per-thread map that can be referenced in the log output formatter. This allows us to output the request ID on every line related to the request, which solves a couple of problems:

When multiple threads are emitting logs at the same time, it is now easy to figure out which log lines relate to which request.
Grepping the log for the request ID will now give all related log lines.
It makes it easy for us to parse the request ID and insert it into a separate column in Elasticsearch, as described in the next section.

In addition to having this valuable information, we also have the frontend send us an email whenever an unexpected exception has happened in either the frontend or backend that includes associated details for us to begin investigating. Email was picked as the medium because these exceptions on their own do not warrant immediate attention, and other alerting mechanisms exist to cover urgent issues. As visible in the below sample email, we can now see the immediate stack traces from both the frontend and the backend. The email also contains the request ID of the request that triggered the exception, which makes it much easier to find all the related lines in the log.

This setup has proven to be very useful for reducing our mean time to detect (MTTD). It allows us to avoid relying on users to report issues they see in the frontend, and we can begin debugging as soon as an issue has occurred. Given the information in the email, we can quickly dig further using our Elasticsearch, Logstash, and Kibana (ELK) setup described below.

Log aggregation using Elasticsearch, Logstash, and Kibana (ELK)

Our ELK stack allows us to aggregate and search logs across multiple products and hosts in a single place. This allows for many use cases; for example, it lets you answer questions like “How often and where does the error X occur?” or “How have our exception rates changed over the past week?” The responsibilities of the various components in ELK are:

Elasticsearch: the underlying database;
Logstash: parses the logs into a structured format and emits these logs into Elasticsearch;
Kibana: a web UI for visualising data from Elasticsearch.

It’s possible to either run Logstash on application servers themselves or to read the logs from other sources. At LinkedIn, we normally send the logs using a Kafka producer on the application boxes, and the logs are then read by Logstash using a Kafka consumer, parsed, and sent to an Elasticsearch cluster. This has the benefit of reducing impact to the running server, since we can do the log processing on a different machine. The diagram below outlines the ELK workflow in the case of Nuage.

As part of our Logstash configuration, we parse the logs to find the request ID and emit it as a separate field in the rows we send to Elasticsearch.

Through this setup, we are able to enter the simple query “request_id: <req_id>” into our Kibana dashboard, which gets us all the relevant logs from the frontend and backend boxes for that request—we don’t even need to know on which machines the request was served. Since we no longer need to figure out where the request was running, grep logs from individual machines, or figure out which interleaved log lines are relevant to a given request, this has greatly simplified our debugging process and helped reduce our MTTR.

Future extensions

Right now, we only associate a request ID with a Nuage frontend request and a Nuage backend processing and response. Given the nature of the product, it is not uncommon for exceptions to bubble up from underlying systems like Espresso or Voldemort. A simple improvement would be for the Nuage backend to forward the request ID it receives and pass this along when making requests to the underlying system. The response from this system could then be attached to the debug email and, using ELK, we could find all the logs associated with the request ID for the frontend, backend, and underlying system, giving us an almost complete view of the logs for debugging purposes. This would require the logs of all these systems to be either sent to the same Elasticsearch cluster or to be able to talk to multiple Elasticsearch clusters through a tribe node in order to have the logs from all systems visible in the same place. If this kind of problem interests you, we are hiring!

Conclusion

Through our ELK setup and request tagging, we can now easily find relevant logs for an observed issue. This, together with our user experienced exception emails that contain the request ID that triggered the exception, has resulted in vast reductions in both MTTD and MTTR for our private cloud administration system, Nuage. If you want more information about Nuage and its goals, you may be interested to read Alex Vauthey’s post on invisible infrastructure.

Topics: Optimization Open Source Natural Language Processing