Eliminating toil with fully automated load testing
December 6, 2019
In 2013, when LinkedIn moved to multiple data centers across the globe, we needed a way to redirect traffic from one data center to another in order to mitigate potential member impact in the event of a disturbance to our services. This need led to the birth of one of the most important pieces of engineering at LinkedIn, called TrafficShift. It provides the ability to move live production traffic from one data center to another in an effortless manner.
As we evolved with new services and saw exponential growth in traffic, keeping the site up remained critical in order to serve our members. It is our job as SREs to ensure the member experience is consistent and reliable. To do that, we need to make certain that our data centers are able to handle the growing demand, while simultaneously being prepared and having the ability to deal with an unexpected disaster scenario. Therefore, we incorporated load testing as a part of our daily operations work so that we can take a proactive approach to identifying our capabilities. Load testing not only helps us to identify the maximum operating capacity of our services, but also highlights any bottlenecks in our process and helps us determine if any services are degrading.
To provide a bit of context, load testing is the practice of targeting a server with simulated HTTP traffic in order to measure capacity and performance of the system. At LinkedIn, we achieve the same effect by targeting our data centers with redirected live user traffic from other LinkedIn data centers to help us identify the maximum queries per second (QPS) a data center can handle. In this blog, we’ll discuss the evolution of our manual load testing process and our journey to fully automating that process. We’ll also review the major challenges we faced while trying to achieve automation, which eventually helped our Site Operations team work more efficiently and save hours from their days.
LinkedIn traffic routing
Whenever a member navigates to https://www.linkedin.com in their browser, they connect to one of our PoPs (Point of Presence) using GeoDNS. These PoPs are the link between members and our data centers. If you’re not familiar with PoPs, think of them as miniature data centers consisting of IP virtual servers (IPVS) and Apache Traffic Servers (ATS) that act as the proxy between our members and data centers.
Figure 1: Stickyrouting and LinkedIn traffic routing architecture
From there, our Stickyrouting service assigns a primary data center and secondary data center for each member. This assignment takes place using a Hadoop job that runs in a regular interval and assigns each member a primary and secondary data center based on geographic distances between members and the data center, while also considering the capacity constraints of each data center. When a primary data center for a member is offline, the secondary data center is used in order to ensure there is no member impact.
A special cookie is used to route the member to their primary colo. This cookie is set by the Apache Traffic Servers (ATS). This cookie contains routing information that indicates a "bucket" within a data center. Buckets are subpartitions of a Stickyrouting partition and a member is assigned a bucket in a data center. In the event the cookie is expired, not available, or if the data center information that the cookie had is itself offline, then the Stickyrouting plugin talks to the backend to fetch information on which data center the member should be redirected to. Stickyrouting is therefore an important service that manages the mapping between member and data center mapping.
Manual load test approach
As mentioned earlier, load testing for us is a daily practice and, for a long time, was a rather manual process. It required setting a predefined amount of production traffic as the target queries per second (QPS) for the load test and then manually figuring out how much QPS needed to be moved from other data centers to the current data center.
This was done by marking the corresponding Stickyrouting buckets in other data centers as offline and using Stickyrouting in a controlled way to ensure we didn’t cause a disturbance to the member experience.
However, before all this comes into the picture, we defined a target QPS for the load test based on historic peak traffic trends and business initiatives planned for the future. We also defined another target for the total QPS while the test was running, to account for live traffic that’s not part of the load test. To expand on that concept, if x is our final QPS, we generally set the target QPS slightly lower, to somewhere near x-y QPS, to accommodate live traffic increments, where y is lower threshold.
We also had a high watermark set, where we encouraged service owners to plan for some extra amount of QPS over the actual load test target. This gave us leverage to anticipate any upticks in site traffic that could happen during the load test.
Furthermore, the engineer defines additional parameters, such as:
Bucket groups that define the number of buckets that will be offlined in one jump to reach target QPS,
Group intervals that define the time to wait after we jump to target QPS,
Bucket intervals that define the time interval between each action being performed on a bucket, and
Duration of load test that specifies how long a load test needs to be performed.
After the target QPS was reached, the engineer manually took control to reach the final QPS and sustain the target QPS for load test duration.
During this time, the engineer would keep an eye on our monitoring dashboards to understand traffic levels and increase or decrease traffic flow to reach the target QPS. The engineer was also responsible for reviewing internal channels in the event that a service owner raised concerns, tracking any error notes, and understanding the latency of the overall site. In the event of an escalation, the engineer was also in charge of connecting with relevant SREs to review a potential issue.
Given all of this was done with live traffic on a daily basis, it could be a nerve-racking experience and required a lot of manual effort. To address this, we explored how we could automate the load test process and save engineers from the 2-3 hours every day that a manual load test requires.
Load test automation and challenges
We decided to tackle this problem in three stages:
Stage one: ramp to 75% of the load test QPS.
Stage two: ramp to 90% of the load test QPS
Stage three: reach our load test target QPS.
The idea was that the first two stages would ramp quickly and we’d need to spend more time carefully ramping traffic in the last stage to reach the target. Given the fact that our automation calculates the number of buckets to offline in the other data centers based on logic already defined, our assumption that the first two stages would ramp quickly proved to be true.
It’s important to note we placed a sleep interval after the stage one and two ramps so that we could get an accurate report from our monitoring systems. We also considered the primary incoming traffic on the particular data center being load tested at these two stages to inform our decision to increase or decrease traffic.
Once we completed stage two, we started trending in a more measured manner toward the target. In addition to having a high watermark of target QPS plus the threshold QPS, we introduced a lower watermark, which was the same threshold subtracted from the target QPS. This gave us a window of where we had to be in order to call the load test successful for a data center.
Figure 2: A fully-automated load test being performed
Therefore, with our focus on these two watermarks, we came up with two small step functions. The first step function would increase the traffic, while the other function would decrease the traffic through the movement of buckets. Stage three made use of these two methods to detect the current traffic and make the decision.
While trying to reach stage three, we needed to address the delay from our monitoring system pipeline, which caused our engineers to take less accurate decisions while shifting traffic from one data center to another. The delay was causing us to rely on outdated traffic metric numbers and this couldn't be fixed at the source itself.
To fetch the most accurate traffic QPS possible, we decided to query the pipeline more frequently, and as we honed in on fetching the most accurate data, we eventually achieved the precise step algorithms.
It was also important to make sure we could control and intervene at any point in the process. Therefore, we built functionalities such as “pause,” where an engineer could look into an alert or a concern from the SRE service owner, or “terminate,” which we could execute at any point in the load test.
Load testing for any organization, especially for those at such a large scale, can be a daunting task for an engineer, but it’s an essential part of our daily routine in assessing our ability to serve our members. As SREs, we strive to eliminate toil as much as we can, which not only helps the organization but also increases the productivity of our engineers and teams. This is why automating the load test process became an essential part of how we operate.
Editor’s note: In case you missed the news, we’ve begun a multi-year journey to the public cloud with Microsoft Azure. Read more about our journey here.