Fixing the Plumbing: How We Identify and Stop Slow Latency Leaks at LinkedIn

Ritesh Maheshwari

Performance Engineering / Video Engineering

October 31, 2017

At LinkedIn, we pay attention to site speed at every step of the release process, from code development to production ramp. But inevitably, the performance of our pages degrades over time (we use the word “pages” to denote webpages as well as mobile apps). In this post, we go over the tools and processes we use to catch and fix these degradations. While this is a post about automation to identify latency leaks, it is also a tale of how a human touch is still necessary.

Background

Over past few years, LinkedIn has focused a lot on improving performance. This includes horizontal work by infrastructure owners, as well as page-specific efforts by page owners, to deliver content faster to our members. However, as we looked at the overall metrics of our work, we noticed an odd trend. While we delivered many impactful optimization projects, our page load times year-over-year did not improve, and in some cases even degraded. An example page and its page load time trend over a year is shown below.

Looking at the chart above, where the dotted red line is a reference point to show where we started the year, notice how site speed improvements tend to be significant and noticeable, as they are optimization-driven. Degradations, however, can generally be of any “amount,” as they happen for various reasons. LinkedIn’s page-serving pipeline has many moving parts. We deploy code multiple times per day, operate a micro-service architecture with hundreds of services, and infrastructure upgrades are frequent. A slowdown in any of these components can cause degradations.

While large degradations can be caught using A/B testing, canary analysis, or anomaly detection, small ones tend to leak to production. Thus, performance of a page has a tendency to always degrade over time.

This led to having the centralized Performance Team focus on identifying these leaks, called “site speed regressions,” and to craft tools and processes to fix them.

Regression rule

We first came up with a rule: “a page should not perform worse than its best performance in the past.”

This rule originates from our guiding principle that site speed is paramount for member experience, and so adding new features to a page should not degrade its performance.

Exceptions
There are some exceptions to the rule. A page can violate this rule if:

It is already faster than LinkedIn’s site speed goals.
A business decision is made to override this, in consultation with the Performance Team.

For the second exception, in most cases, an urgent need to roll out a critical new feature results in superseding the regression rule, but only with a signed-off plan and timeline to bring performance back in line.

Implementation

Data
We use Real User Monitoring (RUM) data and, currently, one specific metric to track site speed: 90th percentile page load time (PLT). We only look at United States data for now, as regressions outside the U.S. are more frequently caused by issues outside the page owner's control, such as undersea cables getting severed, degraded transit links, carriers, etc. Since the regressions we decided to focus on are explicitly targeted towards page owners, most relevant regressions can be observed in U.S. Also, for regressions, we chose the weekly performance data over hourly or daily, as performance measured over a short time period is inherently noisy and can hide slow leaks and generate false positives. Each week is represented by one data point.

How do we define a regression?
We define a page to be in regression whenever its current week’s page load time consistently and reasonably exceeds its baseline.

Baseline

Every page has a page load time baseline. In essence, baseline is the lowest consistent PLT the page has had in the past. Low traffic holidays like Christmas, Thanksgiving, etc., can artificially improve PLT due to low load on CDNs as well as on our backend servers. That’s why consistency matters; we don’t just pick the lowest PLT, but instead the median of three consecutive weeks.

The following pseudo code explains this logic in more concrete terms.

As shown, a page’s baseline decreases as the page performs better, but doesn’t increase if it gets worse.

Next, we’ll discuss two important factors that determine a regression: consistently and reasonably being above baseline.

Consistently above baseline: handling hysteresis
There is inherent noise in real user (RUM) data—this is just how the internet works. We want to catch consistent regressions, and not cases when PLT is above baseline one week and under baseline the next week. To avoid such hysteresis, we employ the commonly-used solution of high and low water marks. When PLT jumps above the high water mark, the page has officially regressed. The regression resolves only if the PLT drops below the low water mark.

Reasonably above baseline: priorities and SLAs
Not all regressions are created equal. Some need more attention than others. So, we have priorities defined for each regression: P0, P1, and P2. An SLA accompanies each priority—this is the amount of time a page owner has to “act” on the issue. More details about how page owners are expected to act will be described in the following section.

A regression is classified as P0, P1, or P2 based on the severity of the leak (PLT degradation) as well as the daily traffic on the page.

See this table for an example setup:

table
Priority	Daily Traffic		% Regression*	SLA
P0	High	AND	>30%	2 days
P0	Medium	AND	>50%	2 days
P1	Medium	AND	>20%	1 week
P2	Low	AND	>10%	3 weeks
End of Regression	>0	AND	<5%

So, e.g., a regression of a “high” traffic page is classified as P0 if PLT this week is above 30% of the baseline (high water mark). This regression will be resolved only when the PLT is less than 5% above the baseline (low water mark). Note that the thresholds are revised periodically, so the thresholds shown here are examples to illustrate the overall concept.

Tickets
We open internal tickets for each regression and assign them to page owners. Page ownership is a widely-used concept at LinkedIn. Each page has exactly one engineering owner. The engineering owner (usually a manager of a team) can use their team’s on-call process and pass ownership of fixing regression to the on-call engineer.

Fixing regressions

Did it work?
Although we have had the ticket process in place for a while, initially it was a struggle to get traction. We would have hundreds of regression tickets open for weeks without any resolution, and had the project marked RED or YELLOW for a long time. In short, the process was not working.

Specific problems we observed were:

It was not made clear who owned solving the issue
Page owners, who ideally should own solving the regression, did not always have the right tools or knowledge needed to understand the issue
Intermittent measurement issues would create false positives and erode trust in the process
The Performance Team was acting like an enforcer and not like an owner, going against LinkedIn’s values

Midway through last year, we stepped back and analyzed the situation holistically. Here’s what we realized:

We had taken an extremely engineering-focused approach by automating as much as possible, and;
We had preemptively solved a scaling issue by distributing all the work to page owners.

Any such large scale distributed system needs a bake-in period. Instead of rolling out such a brand new system and process to all engineering, we needed to perfect the process by targeting a smaller subset of pages. We also had put too much trust in automation without establishing proper feedback mechanism from humans.

Making it work by “Acting Like an Owner”
We now have a Performance engineer assigned to all P0 and P1 tickets to triage them first. While the system has improved, we still occasionally find measurement issues (e.g., performance timing markers moved, data processing failed, etc.). Performance engineers triage the ticket to make sure it is not a measurement issue (and thus a false positive). They also triage to see if this is due to either a global degradation or a local, page-specific, but known problem. Finally, they attempt a basic root cause analysis. Then, they comment on the ticket detailing their findings.

We expect the page owner to drive the ticket from this point on. Performance engineers will help them with further investigations of the problem, or potential unrelated optimization opportunities for the page. But it is the page owner’s job to take ownership of the ticket from this point onwards.

Once the Performance engineer transfers ownership of the ticket to the page owner, the page owner is also responsible for satisfying the SLAs. For P0 regressions, e.g., they need to investigate the high PLT and come up with a plan to get the PLT back under baseline within two business days.

Fixing the root cause versus improving the page
It is important to note that since we are talking about slow leaks, root cause analysis in many cases doesn’t catch one single culprit. More often than not, it is a combination of multiple issues that cause the page to degrade. Hence, page owners are always encouraged to think of optimization opportunities outside of the reasons why the page has regressed.

Where are we now?

We see about four P0 and P1 regressions every week right now for all of LinkedIn. This is in contrast to over 20 P0 and P1 regressions per week last year. As we get better at resolving P1 and P2 regressions, we hope to see P0 regressions less often.

We also noticed that the regression mechanism has become a catch-all for site speed problems. While we have world-class monitoring systems like EKG, XLNT (A/B testing), and Luminol, they all tend to have some false negatives in order to reduce false positives. Anything these systems miss tends to be caught by our regression process.

Next steps

We want to retain the human touch, but simplify the work that Performance engineers and page owners do to get to the root cause (if any). So we are developing more automation around better root cause detection using multiple data sources to point the investigator in the right direction:

What other known events happened when the page went into regression? E.g., deployments, A/B test ramps, global issues, etc.
Tie together different pieces of data. E.g., make use of server side call fanout data, A/B test analysis, etc.
Correlate different metrics within Real User Monitoring (RUM) to understand root cause better. E.g., page load time increase due to increase in TCP connection time could be due to an issue with a PoP.
Find out which combination of dimensions in the data set might have caused the regression. E.g., a bad browser release may have caused a regression.
Better integration of regression management with how bugs are handled by each team. E.g., assign regressions directly to the on-call engineer on the page owner’s team.

Finally, some teams are instituting their own processes to reduce load further on the Performance engineers.

Conclusion

Slow leaks are a menace. LinkedIn has come up with tools and processes to catch them automatically. Fixing these regressions still needs human involvement, and so it cannot be done by a central team. Our process scales because it lets the centralized Performance team use their domain knowledge and expertise with site speed and RUM data to triage regressions, but resolving regressions is distributed across engineering to page owners and their teams.

Acknowledgements

I want to acknowledge Sanjay Dubey, Badri Sridharan, David He, Anant Rao, Haricharan Ramachandra, and Vasanthi Renganathan, who initiated this work with me almost three years back. Also thanks to Brandon Duncan, who helped the Performance Team realize the importance of acting like an owner. And thanks to all the product engineering VPs and Directors who championed site speed along the way. A good culture starts at the top.

This work would not have been possible without significant design and dev work by Steven Pham, Dylan Harris, Ruixuan Hou, Sreedhar Veeravelli, and many others on the Performance Team. Also, thanks to all the performance engineers for putting hours into these regression tickets to help triage them. And finally, thanks to all the page owners for their feedback along the way and for keeping the bar high for site speed at LinkedIn!

Topics: Optimization Infrastructure