How we reduced latency and cost-to-serve by merging two systems

April 22, 2020

Co-authors: Xiang Zhang, Estella Pham, and Ke Wu

Identity services are critical systems that serve data on profile and member settings to help power many other applications at LinkedIn. In this blog post, we’ll share how we merged two layers of the identity services that handle more than half a million queries per second (QPS) that drove a 10% reduction in latency and reduced our annual cost-to-serve significantly. We will also describe the data-driven rationale behind this approach through the context of microservices, the tools we used to make this effort seamless, and the lessons learned from our old design.

Background

LinkedIn uses service-oriented architecture to build systems to deliver various member experiences. Each service abstracts internal domain complexities and exposes functionalities through a well-defined service API. Such abstractions enable evolvability and composability.

Below is a high-level representation of identity services, clients, and downstream services before merging. Client applications call the identity midtier service to get profile and settings data. The midtier service depends on the identity data service, which provides CRUD operations to an Espresso data store. Only a limited logic set is implemented in the identity data service, such as data validation (e.g., data type validation, string length validation, etc.). The midtier invokes other downstream services owned by different teams at LinkedIn to provide important domain-specific data. These downstream services provide spam filtering and blocking, member-to-member networking invitations, member-to-member connections, etc. Based on such information, the midtier implements the business logic to ensure we honor members’ settings, and how they want to interact with LinkedIn and third-party applications.

Architecture of identity services

Motivation

As our Identity applications footprint expanded, and LinkedIn applications and their features grew, the team started to shift focus to performance, cost-to-serve, and operational overheads. One example can be found in this article. In addition, we started to re-evaluate some of our assumptions and practices in developing the identity services.

We observed a couple of downsides of keeping the identity data service separate from the identity midtier:

The design of having the data service separate from the midtier turned out to be less valuable than we initially thought. We discovered that most scalability challenges could be addressed at the storage layer, i.e., the Espresso data store. Furthermore, reads and writes to Espresso were in effect passthrough from the identity data service to Espresso.
Maintaining the data service as a standalone service incurred operational overheads and increased code complexity. We provisioned over 1,000 application instances in multiple data centers for it. Furthermore, we had to maintain the API in the data service to provide access to only the midtier. This involved data modeling, evolving the API, and security, to mention a few.
The business logic in the data service is minimal, and the majority involves data validations.
Keeping the midtier and backend services separate also incurred additional network hops for client applications.

With these considerations, we embarked upon the effort to combine the midtier and the data service into a single service, while maintaining the APIs unchanged. This, at first glance, is counterintuitive, considering that we generally follow service-oriented architectures to tackle complexities by breaking big systems into smaller ones. However, we believe there is a right balance to strike in deconstructing systems. In the case of identity services, the benefits of potential gains from performance, cost-to-serve, and operational overheads triumphed over the additional complexity bundled in a single service.

Implementation

Thanks to the microservice architecture we employ at LinkedIn, we were able to merge the two services with significant footprints into a single one without disrupting our clients. We would merge the code from the data service to the midtier and enable the midtier to interact directly with the data store while keeping the midtier’s interface unchanged. One important goal we had was to maintain the feature and performance parity between the new and old architectures. We were also focused on managing the risks that came with merging two critical applications, and keeping the development cost of the merger to a minimum.

Our implementation was completed in four steps.

Step 1. To seamlessly merge the two code bases and run them in a single service, there were two approaches we could take. An intuitive approach would be copying select code from the data service to the midtier service so that it could perform logic such as data validation and interact with the data store. While that was the cleanest approach, it required a significant amount of upfront development cost before we could validate the idea. Consequently, we opted for a creative “hack” by using the data service’s REST API as a local library in the midtier. We then would have the option to clean up the tech debt once the idea was validated.

Step 2. We gradually ramped the change described in Step 1. At LinkedIn, we have a state-of-the-art A/B testing framework called T-REX. With T-REX, we can create a ramp schedule based on the level of risk and impacts of a change, and generate statistical reports to measure top-tier metrics. This allows us to gradually ramp the change while observing the impacts, and gives us a fast rollback capability (within a few minutes) if needed. Since our change to the two critical services was a high-risk and high-impact kind of change, we took extra caution with our ramp schedule. We ramped one data center after another, and within each data center, we ramped from small percentiles to larger percentiles, with enough time in between to generate reports.

Step 3. We decommissioned the data service hosts.

Step 4. Since we took a creative shortcut in Step 1 by embedding the data service code developed for a REST service as a local library, we needed to clean this up because craftsmanship is an important tenet of our culture. We simplified the layers of classes by removing those classes and interfaces that were used to expose Rest.li services, and kept only the essential classes that interact with the data store.

The diagram below shows the difference in the architecture before and after the change.

diagram-showing-the-new-architecture-of-identity-services

Architecture of identity services after the merger

Performance analysis

To analyze performance gains and provide the apple-to-apple comparison of the services before and after the merger, we used a mechanism called Dark Canary. With dark canary, we can copy the real read-only production traffic to testing hosts, and we can control how and where this happens. For example, we can replicate and multiply read traffic from one production host to a test host. All of this can be done without impacting the hosts serving production traffic, which means we can perform performance testing using production traffic without impacting our business. Below is our dark canary setup.

Dark canary setup

Below are two graphs showing the 90th percentile latency differences between the regular production traffic and the dark canary traffic. On average, the 90th percentile latency dropped from 26.67ms to 24.84ms—a sizable 6.9% drop—for calls to fetch profiles.

comparing-the-latency-differences-between-regular-production-traffic-and-dark-canary-traffic

In general, p99 is very hard to improve, given that many factors can impact the performance. We were able to see that we improved the performance across all percentiles. In summary, the merger improved p50, p90, and p99 by 14%, 6.9%, and 9.6%, respectively.

Memory allocation
To understand the performance characteristics, we analyzed memory allocation rates based on the data from the GC logs on the three hosts involved: the identity midtier dark canary host, the identity midtier production host, and the identity data service hosts. GC logs provide valuable information about object allocation patterns, which usually indicate how performant the application is, and how well the application is using memory. Below is a diagram showing the memory allocation rate on the production host for the identity midtier, where it has an average memory allocation rate of around 350MB/s.

The memory allocation rate in the identity midtier after the merger is about 100MB/s (or 28.6%) lower per host than would have been required to serve the same amount of traffic before the merger. This not only helped improve the performance, but also had a significant impact on the cost-to-serve for identity services, as explained in the next section.

Cost-to-serve reduction

Understanding the cost of running each service is a fundamental business decision parameter. At LinkedIn, we use an in-house framework that enables teams to calculate cost-to-serve based on hardware and the operational costs. Our team used this framework to measure the impact of the merger. After decommissioning the entire data service cluster, the physical resources that we saved added up to over 12,000 cores and over 13,000 GB of memory, which translated into significant annual savings.

Acknowledgements

Thanks to Josh Abadie and Nick Clifford for supporting the feature ramp on the SRE side; Shun-Xuan Wang's contribution to implementing the merger; Winston Zhang's guidance on performance analysis; Szczepan Faber and Priyam Awasthi’s review and feedback on this post; Sriram Panyam and Bef Ayenew for their support on this project.

Topics: Optimization Architecture Open Source A/B Testing/Experimentation