Rebuilding the Profile Highlights System at LinkedIn
February 27, 2019
LinkedIn’s mission is to connect the world’s professionals to make them more productive and successful. Building an active community of engaged members is a key strategy for achieving our mission.
One of the many ways to drive member conversations and active communities is through profile highlights, which are insights between the viewer of a profile and that profile’s owner. Examples of profile highlights include shared connections, experiences, and education, as well as job postings. Given the importance of this feature, we needed a way to further scale this functionality for engineering teams looking create new highlights for our members.
In this post, we will touch on why we decided to rebuild the backend system that supports this feature, its technological background, how we re-architected it, and the benefits and successes achieved thus far.
Motivations for rebuilding the system
The initial profile highlights system was built five years ago to support a relatively smaller number of highlights between two members. Started as a proof of concept, the system was built using a monolithic architecture, where all types of highlights were developed by a single team with limited focus on extensibility and modularity. It worked well for a while. As the company grew to focus on building an active community, new challenges surfaced:
How do we enable fast experimentation and iteration of new types of highlights based on metrics?
How do we enable a distributed development model where each highlight has its own owners and is strictly separated from others? (The distributed development model would allow other teams to contribute to create a highlight ecosystem.)
These challenges were not well addressed by the old system. A new highlight developer would have to understand a significant part of the service in order to add a new highlight type. Numerous changes would have to be made to different parts of the system, from the entry point of the service all the way to the code where downstream service calls are made. This includes but is not limited to adding a method to an interface, adding an implementation for that, wiring in the new implementation, and creating a LiX control (LinkedIn A/B testing infrastructure). All of these applied to writing testing code as well. Consequently, the existing system was more prone to bugs and iteration speed was impacted.
The new architecture
Before we dive into the new architecture, we’d like to briefly introduce the relevant technologies.
LinkedIn employs the microservices architecture to deliver most of the member experiences. A page view can fan out to a large number of downstream service calls to fetch information, such as profile data, connection information, profile highlights, endorsements, etc., from different services. Each of these calls can further fan out to even more downstream services. For example, the profile highlights service invokes other services to get profile data, shared connections, jobs, etc.
Rest.li is an open source framework for building RESTful services.
ParSeq is an open source library to write asynchronous code in Java. Key features provided by ParSeq include parallelization of asynchronous calls and composition of asynchronous calls.
LoadingCache is a cache implementation provided by Google Guava. It combines caching with the ability to load values when the cache misses. It also handles concurrent cache misses when the loading operation is ongoing.
The solution we adopted to address the aforementioned challenges was to re-architect the system into a platform using a plug-in architecture, where each type of highlight can be implemented as a plugin that can then be registered and integrated easily and independently, as depicted in the architectural diagram.
The system is divided into three pieces: platform services, highlight plugins, and the plugin interface.
The plugin interface serves as the hooking point between the profile highlights and the core platform services. Implementation-wise, it uses the classic template method design pattern to provide certain default behaviors while delegating other parts for customization to fetch and assemble individual highlights. Each new type of highlight should implement the provided interface to supply case-specific highlight logic.
Individual highlight plugins
Individual highlight plugins use the plugin interface to integrate with the platform by implementing various methods exposed through the interface. Within the plugin itself, the implementation can vary a lot. Plugins can call different services to fetch different pieces of data and implement complicated logic to compute highlights. They are completely independent of each other by design, and can be under different ownership. The UML diagram below shows the class hierarchy including the plugin interface and example individual plugins.
The platform implements several key components.
Task Manager. The task manager component is responsible for bootstrapping the plugins, creating ParSeq tasks to execute the functionalities implemented in individual plugins, and assembling the results. It exempts highlight implementers from dealing with ParSeq task compositions.
Decoration Manager. LinkedIn’s internal services are structured based on the microservices architecture. In this architecture, a service is a resource identified by an URN (unified resource name). Data is de-normalized such that data provided by a service references data from another service using URNs (similar to foreign keys). The resolution of a URN is called the "decoration" of actual data. The decoration manager exposes a service for highlight plugins to decorate URNs by making remote service calls. We will have a dedicated section later to describe this in more detail, particularly from the performance perspective.
Cache Manager. As mentioned earlier, data within LinkedIn is generally identified by a URN. This applies to the profile highlights service as well. When a set of highlights are computed between a profile owner and profile viewer, an ID is created for each highlight and details are stored in a distributed cache, Couchbase, for a configurable period of time.
Exception Manager. Exception manager takes care of any unhandled exceptions in individual highlight plugins. It’s important to isolate and handle the failures gracefully such that one malfunctioning highlight does not bring down the other types of highlights. While highlight developers can handle failures individually, the platform provides another layer of protection. It also imposes an overall timeout setting such that one highlight plugin with a high timeout setting doesn’t impact the whole request. The exception manager also provides out-of-box support of error metrics for all plugins.
Performance optimization within Decoration Manager
The Decoration Manager helps highlight plugins invoke remote services to resolve URNs as discussed earlier. Service calls are expensive. In a service-oriented world, this can be exacerbated if a system is not designed with that in mind. The fact that the profile highlights system isolates individual highlight plugins exposes another challenge that must be solved: plugin developers can only optimize the performance at the plugin level. There is no way for them to interact with other plugins. To provide a simple example, when a request arrives, the system needs to compute shared educations and shared experiences. For shared educations, it needs to call profiles service to fetch profiles data, and the schools service to fetch school information. In the meantime, shared experiences needs to call the profiles services, too, for the same reason, and the organization service to fetch company information. Both end up calling the same profiles service for the same information, visualized below in red.
This can quickly deteriorate performance and increase the cost of serve when the number of highlights grows. To solve this problem, we built a performance optimization mechanism within the Decoration Manager. The basic idea is to use LoadingCache to cache requests for all downstream calls initiated from a highlight API call. The key of the cache consists of two parts:
The request object itself. The request contains various pieces of information, including the endpoint it’s hitting (/profiles etc.), the projections (i.e., what fields to retrieve), and URNs.
A UUID generated when a highlight API call arrives. The idea is to do per API call caching instead of cross API call caching.
The value of the cache is an asynchronous ParSeq task that encapsulates all the information required to make a service call. The below code snippet shows how a cache like this is defined.
In this code snippet, “MergeRequestKey” is the cache key type containing the request object and UUID. The “createTask” method of “parSeqRestClient” creates ParSeq tasks that get cached. When individual highlight plugins call downstream services, they always go through this cache to create ParSeq tasks encapsulating the calls. If a task was already created for a particular downstream call, it will be reused. In the case that multiple cache requests occur concurrently, LoadingCache gracefully handles the waiting by only creating one ParSeq task. Given that the cache is a per highlight API request, we set the expiration time to 10 seconds, which is more than enough to finish all the downstream service calls while keeping the cache size in control.
With the new architecture, we achieved better architectural properties:
Understandability. Whoever wants to implement a new type of highlight within the service only needs to understand the interface (hooking point) provided by the core platform services. No other knowledge of the system (such as understanding how the highlights will be instantiated, executed, etc.) is required.
Extensibility. Adding a new highlight will be a confined change without the need to touch any other part of the system. All the developer needs to do is 1) implement the creation of the highlight, and 2) register the highlight using annotations.
Testability. Each highlight becomes individually testable. The developer of the new highlight should not modify any existing tests, either.
Reusability. Common functional/non-functional requirements of the system are implemented once and shared by all highlight plugins. Such capabilities include, but are not limited to, timeout management, exception management, default single get implementation, optimization of downstream calls, performance/error metrics, etc.
The performance optimization in the platform has significantly improved latencies and cost to serve compared to the old architecture. The p99, p95, p90, and p50 improved by 29%, 18.6%, 17%, and 7%, respectively, and the cost to serve (including savings from downstream service calls) has been reduced by approximately 50%. We are also set for a much faster iteration of various types of highlights.
This work would not have been possible without the support from Bef Ayenew and Sriram Panyam. Thanks for the early feedback on the design from Vince Liang, Mahesh Vishwanath, Rick Ramirez, and Pratik Daga; contributions to the implementation from Yanhong Yuan and Jingyu Zhu; code reviews from Estella Pham, Kevin Fu, Ke Wu, and Sirish Balaga; and feedback on this blog post from Chris Ng, Anne Trapasso, and Carlos Roman.