ODP: An Infrastructure for On-Demand Service Profiling

January 24, 2017

Coauthors: Tao Feng, John Nicol, Chen Li, Peinan Chen, Hari Ramachandra

LinkedIn has built hundreds of application services, with thousands of instances running in data centers. Optimizing the performance of these services can dramatically improve user experience and reduce operational costs, and profilers are commonly used to help achieve this. LinkedIn’s On-Demand Profiling infrastructure (“ODP”) is one method we use to identify these optimizations.

Introduction

Profiling is a useful method to improve the performance of services. However, the tooling solutions for profiling don’t have fixed standards, are often decentralized, can be costly, and for a company with a large server footprint such as LinkedIn, are inconvenient to use at best.

For example, unless a profiler is supported internally, users may need to configure, acquire licenses, and request installation on remote hosts before profiling. In addition, viewing the profiled results often requires manual data transfer or setting up a tunnel from the production environment to the development environment. Lastly, comparing historical data is difficult or even impossible, especially when profiling runs are captured by different users or different profiling tools.

ODP is our tooling infrastructure to address these pain points. It allows users to debug service performance issues with little manual effort. It also centralizes profiling data so the data can be shared, archived, and compared with other profiling events; this data sharing also allows known issues to be automatically identified. Moreover, this profiling can be scaled for thousands of services across LinkedIn’s data centers. Additionally, this is a plugin-based infrastructure, which can be extended to include memory allocation, thread status, profilers for other languages, and more.

The generality of this approach is useful, but we’ve found few profiling tools that can effectively be used with it so far. For now, we’ve developed our own JVM CPU sampling profiler and are investigating profilers for other languages. These profilers are secondary to the tooling infrastructure and may be replaced with future industry standards, but for now, they have proven themselves to be quite effective when used with the overall framework.

In this post, we describe the overall architecture of ODP and how ODP helps find performance issues with LinkedIn services.

On-demand profiler architecture

The following diagram shows the overall architecture of ODP.

At a high level, here’s what happens:

A user or a scheduled job requests a service be profiled on a specific host.
This request is passed to a REST-based API server. That server deploys the profiler if necessary (if another service on the host has used the profiler already, we don’t need to re-deploy the profiler but can instead reuse that same profiler) and then signals the profiler to attach to the specified service.
The profiler sends its data through a scalable pipeline. After post-processing, it can be publicly viewed on a web-based GUI.

Profiling requests
Profiling requests can come from both users and approved services (e.g., automated testing).

In addition to profiling a service on demand, users can also schedule to profile during regular events, such as traffic shifts. For flexibility, the framework supports these requests coming in from anywhere—as long as they’re authenticated.

REST-based API server
Our REST-based API server serves the scheduled and unscheduled start/stop profiling requests. The server checks whether the profiler is already deployed, and if not, deploys it. Then it tells the profiler to attach to the specified service via a Kafka message.

JVM profiler
The current profiler we have is a sampling profiler based upon JMX. It can connect and profile any JVM-based applications on the same host with no disruption.

The general workflow of the profiler is to collect stack traces and elapsed CPU time for each Java thread via MXBeans at regular intervals, and then to post-process the data in a separate thread. The data is aggregated and periodically sent to Kafka.

Samza
Samza is the scalable streaming platform used within LinkedIn. Even if multiple profilers send data simultaneously to the Kafka topic, Samza can catch up with the produced messages and push them to our remote data store. Samza also provides the benefit of no data loss. We use Samza to pull the profiling data from Kafka and push it to a MySQL database.

GUI/web application
The profiling data is visualized through a web application built with ember.js, flask and flamegraph technology. Flamegraph originates from Brendan Gregg in Netflix; it’s a visualization technique for CPU stack traces. The flamegraph is rendered as interactive SVG, which allows the user to zoom in and zoom out of stack traces easily. The way to read a flamegraph is to read each cell in a given layer as a method call, and the cells in the layer above it as its children method calls. Thus, the highest cells are the deepest method calls.

Performance debugging portal

For each profiling request, the user gets a unique page with:

Information about the profiled service, plus optional comments;
Different display modes, such as sample counts or CPU time;
Widgets to help the user debug performance issues, including top hot leaf methods, leaf-first view, highlight, filter, known issues, thread status, and others;
A flamegraph for the sampled stack traces.

Highlight or filter

The stack traces can be overwhelming to the user. We need to make them manageable, and allow the user to focus on specific sections. To do this, we introduced highlighting and filtering functions. The user can highlight or filter any searchable string (for example, a method name, package name, line number, or a regex combination of those). The flamegraph will re-render and only list the stack traces containing the string in a method name and highlight the specific matches.

Leaf-first mode

The above diagram shows the leaf-first mode, which reverses the stack traces (callee listed at bottom, caller at top). This helps developers spot the code paths where the majority of time is spent. For example, this service spends a significant portion of time in the leftmost stack traces; this may indicate a bottleneck.

Top hot leaf methods

The GUI lists the top hot leaf methods for the profiling event shown above.

If the user is interested in a certain leaf method, clicking the method name will automatically filter out the stack traces pertinent to that method, as shown above.

Thread states visualization

We provide a visualization of thread states at each sampling point. Users can easily find if there’s any thread contention issue and see which method is blocking.

Comparison

Users can also compare profiles. In the figure above, the red color shows an increase of CPU samples compared to the baseline, while the blue color shows a decrease.

Automatic integration with performance test frameworks
The API server provides endpoints to allow trusted services to start/stop profiling events automatically. Some services have test frameworks to catch performance regression issues; we’ve integrated with one such internal system already. The internal framework makes profiling requests to ODP during its performance test runs, providing profiling data that helps find performance regressions.

Performance improvements

In the last few months, many performance improvements and fixes have been made across the LinkedIn stack through usage of ODP. Detecting the bad apples, especially in commonly-used library code, has been a huge win through the profiler, leading to reduced latency and/or CPU usage in many services. Although the issues encountered are varied, some common patterns have emerged:

Exception handling:
JVM exception handling can be slow (orders of magnitude slower on occasion)—but in other cases, the effect is negligible.

Reflection:
Using Reflection in the JVM can be slow. This slowness can show up in surprising places; even getting a class name can have an effect.

Logging:
Logging is a very common event in services, and it’s expected to be cheap. However, old logging frameworks, short-lived logger objects, and function evaluations during logging have all been found to sometimes have performance effects.

Summary

In this post, we’ve given an overview of ODP, our infrastructure for on-demand service profiling. We’ve described its architecture, its features, and some of the wins that LinkedIn has already experienced through the use of this framework. We hope that you see the benefit of such a framework, and can apply something similar for your own systems.

Acknowledgments

The development and use of ODP at LinkedIn has been a significant cross-team effort. We wish to thank Brandon Duncan, Josh Hartman, Haiying Wang, Kumar Pasumarthy and Jason Johnson (and their respective teams)... and of course, all the users of the framework.

Topics: Optimization