Common Issue Detection for CPU Profiling
September 5, 2017
Co-authors: John Nicol, Chen Li, Peinan Chen, Tao Feng, and Hari Ramachandran
LinkedIn has a centralized approach for profiling services that has helped identify many performance issues. However, many of those issues are common across multiple services. In this blog post, we will discuss how we have enhanced our approach to also detect and report common performance issues across the hundreds of application services and thousands of instances running in our data centers.
Profiling of services is a useful method to find optimizations to improve service performance; the ODP (“On-Demand Profiling”) framework has helped identify many performance problems at LinkedIn. However, as these analyses and subsequent optimizations are done for individual services, common performance problems (patterns) are often rediscovered. This results in wasted effort for teams and doesn’t help with the global uptake of important fixes.
We’ll first discuss how to detect these repeated patterns, then how to alert teams to known improvements, and finally how to crowdsource to find new patterns. Note that, although the examples described below are JVM-specific, this is incidental—the same method is applicable to other languages or frameworks that have stack traces, such as .NET, C++, and Python.
Top “hot” methods
Most repeated issue patterns have been discovered by looking at the most-called methods (the “top-down” or “leaf-first” approach). Generally, we don’t fully dive down into our Flame Graphs when doing analyses, but rather simply list the top “hot” methods.
Context-sensitive code searching
Once a candidate “hot” method is noticed, investigation is needed to discover the underlying reason for the time spent in it, and to see if there is an issue and a potential fix. However, the source code for each project is not tied to the UI, so finding the method and line number associated with an issue can be difficult. To speed up this investigation, we’ve added context-sensitive code search capability; this queries a code search tool that’s internal to LinkedIn, but most companies and projects have similar functionality.
Automatic detection of known issues
In a given profile, we automatically search for known issues. This is done by applying a series of regular expression patterns against each stack trace; if a pattern matches, and the aggregate percentage of time spent in that issue is above a configurable threshold, then the issue is flagged and reported. For example, if significant time is spent in exception handling (specifically in the class java.lang.Throwable), this is detected as an issue.
The issue-detection search uses a priority order for known issues, since in most cases, you would only want one issue to be flagged for each stack. For example, a specific version of a library may raise many Exceptions of type java.lang.NumberFormatException. This issue will be flagged as specific to the library method, rather than as a generic “fix slow Exceptions” issue. Alternatively, a hierarchical order can be used, which allows flagging multiple issues for one stack, and allows short-circuiting for searches with no result.
We can also use additional metadata, such as the versions of a service and its dependencies, and command-line arguments, such as JRE version and flags, to determine whether to raise specific issues. For example, some issues are partially addressed with newer libraries or JRE versions—see SecureRandom and Logging discussions below.
Finding new issues
We have two methods to find new issues:
Crowdsourcing: In our profile results, we have a button that allows users to report methods with performance issues, link to a known bug, or write additional details.
Search over all profiled services for common “top” stack traces: Effectively, we can merge our profiles into a single “virtual profile” to find hotspots.
Not all users associated with a service may be aware of a profile result. Also, new common issues may be reported or detected well after a profile is complete.
We address these by inverting our detection process. When a new issue is reported or detected, we rescan the latest profile results for all services to determine which profiles are affected by this issue. This gives us the ability to inform all service owners of all issues, even newly-reported ones.
Performance issue patterns
Many performance improvements and fixes have been made across the LinkedIn stack through this process. Below is an image capture of a partial list of patterns we identify: our full list of patterns is much longer, although many of them are specific to LinkedIn libraries.
The issues encountered are varied, but some common patterns have emerged:
Logging is very common in services, and is expected to be cheap. However, older logging frameworks, synchronized loggers, short-lived logger objects, and function evaluations during logging can all have performance effects. Upgrading from Apache’s log4j1 library to log4j2 with asynchronous logging can result in dramatically better performance—even a 4x throughput improvement in some cases.
java.security.SecureRandom methods are potentially very slow due to blocking for entropy generation, and may also be synchronized—a double whammy. It may be surprising, but this is used in java.util.UUID.randomUUID.
There are two ways to tackle this (there’s a slight synergistic effect when doing both):
Upgrading to a recent JRE will remove synchronization in SecureRandom.nextBytes.
Change the underlying logic that generates entropy, or use a different default entropy source. This is a complex discussion; this Synopsys blog entry has some of the details.
For one service, this resulted in a roughly 40% throughput improvement.
JVM Exceptions can be slow (orders of magnitude slower than non-exceptional cases, on occasion), but in other cases, the effect is negligible. This blog post has an enlightening discussion on the issue.
Generally the best fix is to not throw the Exception; however, when that is not possible, there are workarounds, such as caching the Exception or reducing its stack trace.
One service at LinkedIn improved its throughput by 35% simply by not raising NumberFormatExceptions during String parsing. A similar optimization is available in the Google Guava tryParse methods.
ForkJoinPool is a concurrency framework for parallel processing. It seems that CPU spinning from java.util.concurrent.ForkJoinPool.awaitWork experienced a performance regression, per JDK-8080623. One service at LinkedIn experienced a 25% throughput improvement by refactoring, but alternatively, a JRE upgrade should resolve this as well.
Other classes affected by contention
We have additional rules to help improve multithreaded performance: prefer java.lang.StringBuilder to synchronized java.lang.StringBuffer, prefer java.util.concurrent.ThreadLocalRandom to java.util.Random, prefer unsynchronized Maps to java.util.Hashtable, and others. Some of these are subtle concerns, but in multithreaded services with contention, they can affect 99th percentile latency, for example.
We’ve created additional rules, including some for Java Reflection (sometimes slow: cache results when possible) and regular expression matches (hand-parsed routines can be faster). These are straightforward and generally minor concerns and so have higher thresholds before they are flagged, but they can contribute to slow methods if called often.
In this post, we’ve given an overview of our “common issue detection” feature for CPU profiling, and have described some of the improvements we’ve experienced. We expect even more improvements as additional issue patterns are found and added. We hope that you see the benefit of such a feature, and can apply something similar for your own systems.
The development and use of this feature at LinkedIn has been a significant cross-team effort. We wish to thank Brandon Duncan, Josh Hartman, Jason Johnson, Todd Palino, Chris Gomes, Yi Feng, their respective teams, and of course, all the users of the framework.