How LinkedIn Adopted A GraphQL Architecture for Product Development
April 25, 2023
With the widespread adoption of Rest.li since its inception in 2013, LinkedIn has built thousands of microservices to enable the exchange of data with our engineers and our external partners. Though this microservice architecture has worked out really well for our API engineers, when our clients need to fetch data they find themselves talking to several of these microservices. Over time, this has become a challenge as every client is faced with a few key issues:
Figuring out which microservice serves the right data and making several round trips to fetch the data they need.
Addressing partial failures and resilience issues due to multiple network calls in the distributed microservices architecture.
Dealing with inefficiencies due to duplicated downstream calls while resolving a tree of nodes.
In our previous blog post on GraphQL, we explained how LinkedIn uses GraphQL to expedite the process of onboarding new use-cases for external API partners. In this blog post, we will cover how the GraphQL layer is architected for use by our internal engineers to build member and customer facing applications. Specifically, we will dive into some of the architectural choices that are unique to LinkedIn and why we chose each one of them.
Advent of Deco, An In-house Solution
To address the client-side issues mentioned earlier, LinkedIn created an internal library called Deco which allowed our client engineers to express the data they wanted using a proprietary query language and let Deco figure out the efficient way to fetch the requested data. Since it was built as a library, several of our mid-tier services adopted Deco to hand off the complex logic of fanning out and fetching data from downstream services. Though Deco addressed the functional gaps mentioned above, its usability problems became pronounced when LinkedIn adopted the technology on our frontend stack (commonly referred to as “backend-for-frontend” or “BFF”) where hundreds of our frontend client engineers started using the proprietary query language to express their application data needs.
We started noticing some real shortcomings of Deco, which had a direct impact on our productivity. Deco being schema-less, the client queries and the response data couldn’t be validated which caused unexpected issues in production. On the developer experience front, Deco’s query language being cryptic made writing, testing and maintaining queries hard especially with no developer tools to assist our engineers during the development time. While Deco was growing in application and use, we saw an alternative, GraphQL, emerging in the industry. GraphQL had already become public and the industry was showing widespread adoption. We had two choices in hand - invest more in Deco to address the pressing issues, or adopt GraphQL and build the necessary infrastructure around it for our needs.
When choosing GraphQL as a technology, we took sufficient time to understand how GraphQL would fit into our current tech stack, and the best way to adopt it without setting aside a lot of infrastructure we had invested in and built over time. We also made sure GraphQL wasn’t solving a specific problem for a specific team, but rather introducing a paradigm shift that would boost productivity across the company. After careful consideration and discussions with several partner teams, we decided to adopt GraphQL at LinkedIn.
Moving to GraphQL was a huge initiative that changed the development workflow for thousands of engineers who work tirelessly building products at LinkedIn. To keep the core team’s focus clear and the logistics under our control, we decided to start with a smaller scope of only enabling read operations through GraphQL and target only our frontend stack. Taking this approach helped us reduce distractions while we built the infrastructure for all of our future needs at LinkedIn.
The GraphQL architecture we use at LinkedIn is unique due to some of the decisions we took during the design phase. Here are some of the primary differences compared to the widely used architecture in the industry:
The GraphQL type system used at LinkedIn is completely autogenerated.
The GraphQL query execution endpoint is distributed and available on each individual frontend microservice.
Only pre-registered queries are allowed for execution on our production servers.
In the next few sections we will go over why we took this approach while also introducing the different components that enabled us to adopt GraphQL with this architecture.
Type System Generator
The GraphQL type system used at LinkedIn is autogenerated by federating individual entity schemas from the Rest.li world. The goal behind this is even after the introduction of GraphQL we wanted our API engineers to continue building and operating microservices. As briefly mentioned before, LinkedIn has a well established process in place to build and operate microservices in an efficient manner, and we didn’t want to introduce another layer of schema (authoring, reviewing, etc.) specifically for GraphQL. Given that our Rest.li entity schemas already go through a review process, a standardized way of federating these Rest.li schemas is sufficient to keep our GraphQL schemas manageable and easy to generate.
Our custom schema generator takes these individual Rest.li resource definitions, their entity schemas and their relationship definitions as input and generates a federated GraphQL type system.
Figure 1: GraphQL schema federation from Rest.li artifacts
Let’s take a very simple example with two resources to demonstrate this federation process:
Figure 2: Code block examples for the GraphQL federation process
In addition to the current review process we have on individual Rest.li schemas, we built these two additional components to keep the generated GraphQL type system under check:
Backward compatibility checker for the GraphQL type system to prevent unexpected breaking changes from being released to our production servers.
Custom lint rules on Rest.li entity schemas to prevent inconsistencies in the generated GraphQL type system.
My colleagues Min Chen and Karthik Balasubramanian gave a talk on how Rest.li’s Entity Relationship modeling framework helps with the federation and generating the GraphQL type system. We highly recommend this talk if you are interested in this space.
Query Execution Endpoint
The GraphQL query execution endpoint is implemented as a Rest.li resource and all our frontend microservices now include this resource by default. The GraphQL query execution endpoint is co-located with other Rest.li resource endpoints in that microservice. Unlike what is typically seen in the industry, we chose to not spin up a new service layer that will solely act as a GraphQL gateway sitting in front of our existing frontend microservices. This was done for two reasons:
Prevent an additional network hop between the client and the frontend servers and all the associated costs that comes with it.
Avoid building and operating a ‘central’ gateway service for all frontend GraphQL requests, which creates a potential single-point of failure.
This architectural design has both pros and cons and we took them all into consideration during the decision making process.
Our query execution engine is built on top of graphql-java, the open-source Java implementation of official GraphQL spec. In addition to auto-generating the type system, the field data fetchers used at runtime are written once and re-used by all our GraphQL instances. These data fetchers are written in a way to use the metadata available on our type system to self-configure and wire appropriately during the service startup. The same applies to data loaders which are used for batching and de-duplicating downstream requests from the engine. Both of these features make it extremely simple for our API engineers to expose their existing Rest.li resources through GraphQL.
Taking the same example resources (members and companies) from above, the resources are exposed through two different microservices (isolated physical clusters) and operated independently by different teams. But from the client's perspective, they are all part of one connected graph and it doesn't matter which microservice exposes which resource(s). To help with this, the GraphQL query execution endpoint on all of our frontend microservices expose the same federated GraphQL type system which includes operations from both of the resources. This way, any GraphQL query could be executed by any of these query execution endpoints and the engine will use the appropriate Rest.li downstream to fetch data.
In the next subsections, we will cover some of the components we build to take advantage of the distributed query execution endpoint architecture.
With the query execution endpoint co-located with other frontend Rest.li resource endpoints, we took advantage of that and designed the Rest.li client used on our data fetchers to prevent unnecessary network calls that are meant for fetching data from resources available on the same host. Before making any downstream call, the client inspects whether the target Rest.li resource is available on the same host; if available, it makes an in-process call (like a normal method call) which avoids expensive operations like data serialization/deserialization, input/output validation, and network transport.
LinkedIn’s frontend microservice architecture has several isolated clusters where each cluster is responsible for serving requests to a specific microservice. With distributed GraphQL query execution endpoint setup, any incoming query can be routed to any one of the query execution endpoints. But if we can route the incoming GraphQL request to the cluster which also has the Rest.li resource required for fetching the top-level field in the query, it gives us the following benefits:
Performance: Avoid an additional network hop (leverage in-process resolution) while processing the top-level field and, most importantly, achieve the same performance characteristics as we have when using Rest.li.
Predictable capacity planning: Without this targeted routing, the capacity planning on our frontend clusters would become tricky as every individual microservice would need to be provisioned and operated for the peak load across service boundaries.
Security: Inter-cluster communication is minimized and so access for inter-cluster communication is granted only when there is a need.
We achieved this by including routing metadata with each registered query. This information is used at our traffic tier to route the incoming request to the correct cluster. With this routing logic setup, our client engineers do not need to understand the multi-cluster microservice architecture in the API layer, and can simply use an endpoint like /graphql for all requests, allowing the routing logic to take care of routing the request to the correct cluster.
We can see this approach in action with an example query execution that fetches member information given their id. Based on the query metadata, the external request gets routed to the graphql endpoint on the correct cluster (Identity), uses an in-process request for fetching the member information followed by an intra datacenter request for fetching the company information.
Figure 3: GraphQL query routing on a multi-cluster environment
This distributed query execution endpoint architecture is not without disadvantages. One disadvantage is that our GraphQL queries are restricted to have only one top-level field as having multiple top-level fields can conflict with our routing setup. Right now we don’t have a need for such queries, and we plan to revisit this in the future.
Early in the design phase, one of the other decisions we made is to only allow executions of pre-registered queries on our production servers. Queries are typically authored by our client engineers and checked-in to client code repositories along with the application code. These queries get registered to a central service called Query Registry Service from the client code release pipeline. Each query has a unique identifier which is generated at build time. Once registered, the query is immediately available for use at runtime. All GraphQL requests from our production clients include a query identifier that is resolved by the query execution endpoint before execution. Since these queries are immutable, once resolved from the Query Registry Service, they are cached on the query execution endpoint for reuse.
Pre-registered queries and a central Query Registry Service provide the following advantages:
Performance and Efficiency: Pre-registered queries allow us to cache processed query metadata on the server side for better performance. Some of our queries are very large, and including them on every request at our scale would adversely affect performance.
Security: Restricting query executions to only pre-registered queries allows us to prevent any unintentional or malicious attempt to execute expensive queries.
Developer Experience: Having a central place to register queries allows our client engineers to author and save their queries in their codebase along with the client code. The registry acts as an intermediary between the client and server and makes client-owned queries available for use at runtime on the server side.
Though this restriction is in place for our production clients, our engineers are free to use raw queries during development, prior to their registration. Here is an illustration of how the end-to-end flow works when using GraphQL during development and from released client applications.
Figure 4: End-to-end GraphQL architecture
Interoperability with Deco
At LinkedIn’s size, migrating every frontend use-case to GraphQL is a major undertaking and it is expected to take a while to complete. Given that we are in a transitory period, the primary requirement is to ensure our client applications continue to operate and evolve without any friction while we slowly migrate existing use-cases to GraphQL. The main challenge was in handling the incompatible differences in data format when using GraphQL vs Deco. To accomplish this, we built a thin layer on the API side to return the response in GraphQL format on-demand, even while using Deco. With this, we did a complete refactor on our existing client codebases to handle responses in GraphQL format irrespective of how the data is fetched. This enabled us to interoperate and let our engineers onboard their use-cases to GraphQL without worrying about any data inconsistencies during migration.
Deco is a homegrown solution that is highly tuned for our needs and ecosystem. Getting GraphQL to perform to our specific needs was one of the other major hurdles we had to overcome before making the new technologies available for adoption. Using pre-registered queries was helpful, because we could compute and cache a lot of query metadata for a registered query that was re-used at runtime. More on this in a separate post later.
GraphQL is now the default for building any new read use-cases on our frontend stack. Some of our critical pillars like the recently re-architected profile framework are powered by GraphQL. In addition to the new use-cases, for the past two quarters, we have been actively migrating existing read use-cases to GraphQL. While the product teams are adopting GraphQL, the infrastructure teams are hard at work building the infrastructure for supporting writes through GraphQL, integrating with more datasources like gRPC services and bringing GraphQL to our backend service-to-service communication.
Rolling out GraphQL at LinkedIn has been a multi-quarter effort which brought together several infrastructure and product teams. I want to extend a huge thanks to our amazing engineers, managers and program managers in the core GraphQL team for making this a reality and help modernize our technology stack. Specifically I want to extend my gratitude to Heather McKelvey, Goksel Genc, Maxime Lamure and Qunzeng Liu for taking a bet in sponsoring this initiative from Service Infrastructure and Karthik Ramgopal and Min Chen for their continued technical guidance.