Improving Resiliency and Stability of a Large-scale Monolithic API Service
November 28, 2017
How do you increase the resiliency and stability of a monolithic API service that is used by three different platforms, serving 500+ million members, developed by over 400 engineers, deployed three times per day, and consuming almost 300 downstream services?
The API layer service used by LinkedIn.com and the LinkedIn mobile applications is one such service, and in this post, we want to share with you how we increased its resiliency and stability.
At LinkedIn, a new API service was born out of the need to address the new requirements brought in by the company’s new "mobile-first" strategy. We named the new API service Voyager-API. As a replacement for its predecessor, Voyager-API had to be developed rapidly to be able to catch up with the live service and become feature equivalent, and then carry this momentum into becoming the next generation API for LinkedIn.
Two of the design principles we followed for Voyager-API are:
Cross-platform: All platforms, mobile and web, use the same API service. Furthermore, they use the same endpoints in the API for the same features.
All-encompassing: One API service serves all product verticals, where a vertical is a set of related features owned by a dedicated team.
These two principles enabled a high level of reuse both across platforms and across verticals. Reuse of code across verticals reduced the the size of the software artifacts (.jar or .war files) built for the service. Reuse of endpoints and data-schema definitions across platforms improved the engineers' ability to collaborate. Additionally, having an architecturally monolithic service simplified deployment and lifecycle management by making it unnecessary to coordinate versions between multiple platforms and features. Taken together, the benefits allowed more than 400 engineers to keep a fast rate of iteration in this new API service project.
We knew that one day soon Voyager-API’s growing size would start to offset the benefits we reaped from the above design principles. We waited until the problems could be more clearly identified, and then started analyzing alternative solutions.
Problems appeared at scale
Scale creates new engineering problems. Voyager-API grew rapidly in terms of code and traffic. Within one year, hundreds of endpoints were added, supporting a diverse set of features. During Voyager-API’s development, our user base grew by more than 25%, accompanying a comparable increase in daily active users.
Since Voyager-API was a monolithic service, all deployed instances were serving every endpoint it had. That meant that any misbehaving endpoint could increase the load on all instances, thereby increasing latencies for other endpoints or even clogging the whole API service. A bug in the code or increased latency of a downstream service is all it takes for an endpoint to misbehave. In a service constituted by hundreds of endpoints and developed by hundreds of engineers, this becomes a not-so-infrequent event.
This resulted in frequent degraded experiences for our members, including partial outages. Deployments were intermittently put on hold as production issues threatened the stability of the site. Panels of engineers were frequently engaged to triage and fix issues.
As a first response, we tried a couple of approaches.
Horizontal scaling: We tried scaling our service horizontally, in an attempt to be more resilient to downstream backpressure. In fact, we scaled our service to unprecedented numbers. This worked to an extent—with more hosts available, the application was able to survive downstream degradations that would have killed it before. However, this approach consumed large amounts of hardware, and proved to be non-scalable.
Timeouts: We set timeouts for inbound requests to the service. However, downstream services did not have SLAs, and some of the downstreams had big variations in their response times. As a result, we had to raise some of the timeouts to a level that diminished their value. In a sense, this only served as a temporary fix.
Optimizations: We tried to optimize various aspects of the service. We experimented with methods to reduce the memory footprint of the application, and fine-tuned the memory profile in deployment context. We also did our best to fine-tune the garbage collector behavior of our JVM processes. While this helped, it was insufficient on its own to solve the larger problem.
The problem with Voyager-API was that we over-applied the second design principle, which asked for a monolithic application to improve reuse across verticals and increase iteration speed: we extended this principle to deployment architecture.
Breaking Voyager-API down into multiple smaller services would mean giving up on this design principle, and its already-observed benefits. It would have been a very big project, would have required keeping engineering resources away from development of new features, might have decreased developer productivity, and might have required new toolsets to be developed. For these reasons, a search for a less risky solution was well justified.
The approach we ended up taking is what we call "multi-clustering."
What is multi-clustering?
Multi-clustering means partitioning the set of the endpoints of the service without breaking code, and then routing the traffic for each partition to a dedicated set of servers, called a “cluster.” Each server runs an exact copy of the monolithic application.
The below diagram describes multi-clustering in more meaningful way.
How was multi-clustering achieved?
We started by partitioning the endpoints of our service. This was relatively easy to do, because endpoints in the service were already divided into separate ownership groups called "verticals." Then, using the data collected by our monitoring systems, we identified the verticals that had enough traffic to justify separating them.
Then, we laid out the plans. We prepared a runbook that described each step we would take for each vertical. We also prepared a schedule skeleton that we instantiated for each vertical, including tasks for infrastructure team, SREs, and vertical team. These allowed us to inform each vertical team and set a pace with all stakeholders.
For each vertical, we started by modifying our build system to create an additional deployable named after the vertical it would serve. This dedicated deployable also had its own configuration that inherited from and extended the shared configuration for the service.
In parallel, we started examining the traffic being received by the vertical’s endpoints in order to make an estimation of the number of servers that would be needed in the new cluster. Our calculation was simple:
VERTICAL_MAX_QPS: Maximum number of queries served by a vertical during a colo failover
EXTRA_CAP: A multiplier to account for additional capacity needed for downstream anomalies. Taken as 1.20, thereby increasing capacity by 20%.
HOST_MAX_QPS: Maximum QPS per host our service can serve (across all endpoints)
Initial cluster size = VERTICAL_MAX_QPS * EXTRA_CAP / HOST_MAX_QPS
We considered this estimation as an intelligent risk, to be mitigated by our ability to ramp traffic to the new deployables in small steps, and being able to closely observe QPS capacity, latencies, and error rates through our monitoring infrastructure for each vertical. In fact, it later turned out that the max QPS our service can serve for individual verticals varies between 30-300% of this observation! In the end, we were able to account for this variability in our process for determining final cluster size without any service disruptions.
Once we found the estimated size of the cluster, we put in our request for server resources. LinkedIn's infrastructure tooling made it possible for us to quickly set up the new servers for the new deployables. Once the new servers were ready, we were able to move forward with deployment.
After deployment, we were ready to route traffic to the cluster. There are two traffic sources for API layer services at LinkedIn: traffic between our services that uses Rest.li D2 protocol, and HTTP(S) traffic from the internet through our (reverse) proxies in the traffic routing layer. Specific to our service (since it's an API service serving frontends), the D2 traffic was much smaller compared to the HTTP traffic. This was a useful coincidence, because although our traffic systems allowed us to determine which percentage of the traffic would be routed to which deployable, we were unable to do this kind of traffic-shaping using the fully distributed Rest.li D2. Therefore, we first ramped our new deployables to share the D2 load for the vertical endpoints with the existing service, then stopped the existing service from taking D2 traffic, and finally ramped the HTTP traffic in steps.
While ramping the HTTP traffic, we did capacity testing. When we had enough traffic to overload at least three servers, we slowly took down servers, while observing the latencies and error rates. This process told us how many QPS each server was capable of processing without incidents for the set of endpoints we were separating. In turn, we used this observation as a basis for capacity planning, and it allowed us to fine-tune our resource allocation. It was part of our policy not to ramp traffic routing to 100% before this tuning of resources.
Since this is a living system, all of these changes happened in parallel to any development and feature ramps that vertical teams were working on. Hence, every step was carefully communicated to all stakeholders. That allowed vertical teams to be able to take the status of the transition from old service deployables to new ones when investigating problems they faced.
The most immediate impact of multi-clustering has been on resiliency. As we expected, we are now able to limit downstream failures or bugs in one vertical of the product, avoiding having them cascade to other verticals. Hence, we achieved our most important goal for this project. We have observed instances where one cluster has gone down due to backend issues, but linkedin.com was able to stay up.
In addition to improving resiliency, it has become possible to tune each cluster, which only serves the set of endpoints that belong to a single vertical, independently from others. The increased deployment granularity made the following possible:
We are able to do capacity planning much better. Separation of clusters allowed us to stress test each vertical separately, which was not possible before. We discovered considerable variation between per-host response capacity for different verticals. These measurements allowed us to fine-tune the resources for each vertical, and informed the engineers about whether and where they should improve their code.
We can monitor the service in a much better way. The logs for each vertical are separated, making it easier for the vertical teams to locate relevant information in the logs. Our regression monitoring system, called EKG, works by comparing performance between the previous and new deployment versions. EKG can now run in a per-vertical manner. Running EKG in this way increases the intelligibility of its results, and solves the problem of high QPS end-points in some verticals preventing regressions from being noticed in lower-QPS endpoints in other verticals.
We have more control over deployment architecture. The average downstream service fanout for the newly-created clusters is reduced to 35% of the fanout for the single-cluster deployment, improving stability and decreasing the statistical likelihood of service failure. Clustering based on verticals improves our ability to fine-tune security settings in our backend servers. We are also able to improve our backend resiliency by fine-tuning incoming query quotas for our backend servers, taking advantage of higher granularity in the Voyager-API deployment.
Voyager-API was built as a monolithic application to serve all platforms for LinkedIn's flagship product. Voyager-API's monolithic structure improved cross-platform collaboration and cross-vertical reuse, and hence supported faster iterations. However, with growing scale, its deployment as a monolithic application created risks of cascading failures across the different verticals it served.
In order to improve its resiliency and stability, we took an approach we call "multi-clustering": we partitioned the set of endpoints along product verticals, and deployed a separate copy of the whole application for each partition, so as to not forgo the observed benefits of developing a monolithic app. This correlated nicely with the organizational structure as well, as each of these partitions were owned by a separate team.
As a result of our multi-clustering effort, we were able to reach our objective of increasing resiliency and stability. In addition, we improved our capacity planning, deployment fine-tuning, and developer productivity.
Along with the authors of the document, we would like to acknowledge sincere thanks to Diego Buthay, Theodore Ni, Aditya Modi, Felipe Salum, Jingjing Sun, Maheswaran Veluchamy, and Anthony Miller, who also worked on the gigantic multi-clustering effort.