Entity Routing Service (ERS) load test evolution
June 2, 2022
LinkedIn is operating today at a scale of nearly 1.6M graph queries per second (QPS), with 830M+ members. At this kind of scale (and seeing roughly 34% year-over-year growth), it can be hard to predict and manage capacity. Beyond being prepared for everyday operations, we also want to ensure LinkedIn could handle a catastrophic issue, such as one data center experiencing critical issues and going offline, which would in turn affect our customers’ and members' experience on the site. This is called High Availability/Disaster Recovery (HA/DR). To address both these challenges, we do load tests every day by moving end user traffic (referred to as a “TrafficShift”) to a specific target data center and then simulating the load scenarios with a specific QPS target. For context, LinkedIn operates as active-active segmented traffic across multiple data centers, and each member has a “home” in a particular data center. During this kind of TrafficShift, we move those members’ homebase from one data center to another. This is referred to as Member Sticky Routing (MSR), and you can read more about it in this blog post. To maintain member segmentation and session consistency, MSR is used as first-level routing at the Apache Traffic Server (ATS level - L1 proxy).
However, for enterprise businesses, we were not able to follow this pattern. An enterprise business has a concept of a “customer entity,” which requires maintaining strong consistency for all users within the same customer scope. Cross data center replication lag is a problem that leads to members or admins seeing stale, inconsistent data.The impact of this lag is greater for enterprise products because more members are operating on the same data set (high cardinality). To mitigate this, the application does second-level routing using Entity Routing Service (ERS), funneling users from the same contract to a specific data center to achieve consistent data access.
ERS has solved the cross-data center replication problem, but introduced a different set of challenges. During a load test, we failout of data center and now we needed to failout of all users from the same contract at the same time to maintain data consistency. This leads to an unpredictable and skewed load. Load testing was put on hold for some areas of our enterprise business. However, without regular load testing, capacity management becomes very unpredictable and prone to errors. We needed a solution that would allow for load testing on a regular basis for our enterprise products; our answer was a staggered approach that not only enabled load-testing, but also optimized and scaled for other areas. In this blog post, we’ll explore our evolution of enterprise load testing in three phases, broken down as “crawl,” “walk,” and “run.”
Phase One: Crawl (Foundation)
The first problem that we needed to address was availability. During ERS traffic shifts, write transactions were historically blocked for several minutes (in the range of 5-7 minutes) in most cases. This was due to the fact that, prior to switching to the target data center, we paused to let the in-flight transactions complete and catch up with cross data center replication. It helped avoid data corruption, but it caused a disruption to the user experience and availability of some critical features like Recruiter Search or Job Posting. We added a banner to indicate within the product that site maintenance was underway, and might result in a disruption, but it still wasn’t an ideal situation for our customers.
It’s worth noting that while MSR traffic shifts also lead to a temporary interruption in data freshness during cross data center replication, the impact is less because it only affects a single member’s own data for a short period.
To solve this problem, we analyzed the historical replication lag across our datastores (a relational Oracle database and LinkedIn’s name-value pair database, Espresso) and observed P99 lag was less than 20 seconds. We ran through each case where we need strong consistency across the data center, as well as where we could potentially relax the policy. We also reviewed multiple options to handle cross data center read-after-write consistency, but ultimately came to the realization that ERS was the best approach. Based on these findings, we decided to adjust the block from the current range of 5-7 minutes to 30 seconds. While write transactions continue to be blocked during that time, the shorter period greatly lessens the impact, enabling us to remove the product banner alert (although error messages can be raised if needed). With careful observation and monitoring, and after tuning the database replication, we further reduced the blocking window to the 20-second range.
Phase Two: Walk (Automation)
Once the replication lag was reduced to less than 20 seconds, it became more pragmatic to start exploring regular load testing. We started load testing with one enterprise product and a gradual ramp. However, we immediately noticed that this complex operation was causing a partial service outage impacting our customers and members. We started analyzing the situation to find the issues and work out how we could do load testing without causing customer or member impact. After our investigation, three broad issues emerged.
The first area where we saw problems was with the detection aspects of our load testing. Our signals weren’t always accurate: sometimes, customers were experiencing negative impacts but we had no alert; other times, we had false positives. It was paramount for us to understand and optimize for the customer and member experience, so we needed to have a clear line of sight into what that really was.
The reason why sometimes all the dependent services looked healthy and no alerts were fired, but customers were experiencing either an incident or partial outage was that, as the diagram shows, if each layer could be slightly above the threshold, but the top level service could still be below the threshold level in the aggregate. For False positives, the issue was that sometimes services would throw alerts when customers weren’t actually seeing issues, thanks to graceful degradation. In other words, the system was able to absorb the pressure without presenting a negative customer experience. However, the false positive alert could lead us to abort or stop load testing.
To resolve the detection issue, we started looking at customer back metrics, which are instrumented at the feature level using real user monitoring (RUM). Additionally, we started scanning the error rate at the L1 level and doing service health scans. With all these adjustments, it remained important to avoid raising false positives, which would pause load testing, while also retaining accurate sensitivity to avoid negative customer or member impact.
As we improved on the detection mechanism, we realized that there was also room for improvement regarding the time spent making decisions on what to do in the event of issues. There were really only three choices that could be made in these instances:
Hold: do not increase load anymore, keep the load at the current level and observe.
Abort: if the issues start appearing at a moderate level, abort the load test and rebalance the traffic across data centers.
Failout: if the issue is really bad in the target data center, we fail out all traffic from the data center to fix the issue.
Previously, when there were any issues during load testing, teams would gather and discuss options and the best possible path forward. This could take 10 minutes on average, all the while customer issues were ongoing. To resolve this, we found a great opportunity to automate the decision making process and trigger mitigation steps. The key is to be able to detect the problem at the stage when the issue is at the lowest level of severity. For this reason, a static threshold-based alert does not work. Instead, we use an error factor with time series data analysis. This helps us to know if the customer experience is declining rapidly or not.
Since the decision making process was now automated, we turned to see if mitigation could also be automated. Sure enough, everything is available as an API at LinkedIn. We found a way to call the load test API and follow the appropriate mitigation steps. Along the way, we keep monitoring sites and validate if the mitigation is really helping. Our hope is that one day, a super complex process such as load testing can be fully automated and run in autopilot mode.
Phase Three: Run (Optimization)
This phase is really about optimizing the solution even further and rolling it out to other lines of business. During load testing, if an incident is severe, we decide to fail out of the target data center to recover. Historically, with the 5-7 minute ERS block, we would always do the MSR failout first, followed by the ERS one. The MSR failout takes about 20 minutes typically, and during this time, enterprise customers were being affected. We analyzed all dependencies and came up with a plan to simultaneously failout traffic for MSR and ERS in parallel, to solve this problem. This helped our enterprise businesses recover sooner from prolonged impact.
Another optimization opportunity is around the velocity of failout. ERS failout is pretty fast (less than a minute) but for MSR and other lines of business, it may take longer. This is primarily due to the step-by-step nature of these failouts. This can be improved by failing out 50-70% as the first step, and then slowly moving the remaining traffic. Because a data center can hold 50% of load easily, there is little risk in moving big and fast as a first step. We’re still working on testing out this process and model, but are excited to continue working on the idea.
We recognized why we need to do regular load testing for ERS based applications and why this was never possible with the existing setup. In the crawl phase, we focused on the foundation by solving key problems and starting the load testing. This was achieved by reducing the replication lag between data centers while maintaining data consistency. Once we started doing regular load testing, we realized there was too much toil and a lot of manual coordination causing a bigger impact for customers and members. In the walk phase, we invested in automation to reduce toil, automatically make decisions, and focus on running ERS load testing in autopilot mode. Lastly, in the run phase, we looked at optimization. Since load testing pushes the limit, sometimes we run into issues that are expected but the key is to minimize the impact so, we focused on fast failout in case of issues.
As a future extension of the project, we are going to explore how much load the product line can absorb. This would be a paradigm shift; right now, with constant changes and new code being deployed, what we certified yesterday may not be valid today. Being able to predict this would be a great opportunity, enabling us to know in advance how much load the site supports, without needing to do daily load testing.
There are a few takeaways and learnings from this evolution: 1) challenge yourself—do not accept how things are running and believe in ways it could be better, 2) focus on quick wins (foundation) that can help get traction first, and 3) rally the team for betterment through automation and optimization.