Authorization at LinkedIn’s Scale

Michael Leong

Senior Staff Software Engineer at LinkedIn

March 19, 2019

LinkedIn members entrust us with their personal data and we are committed to working hard every day to maintain that trust within a safe, professional environment. One crucial aspect to earn and maintain that trust lies in how well we secure our online systems and protect our data from unauthorized exposure. LinkedIn runs a microservice architecture, in which each microservice retrieves data from other sources and serves it to its clients. These calls often retrieve or process data that our members have entrusted to us. Our mission to securely handle members’ data highlights the importance of authorization controls in these microservices. Microservices can only access data with a valid business use case. This prevents the unnecessary spreading of data and minimizes the damage if one of our internal services were to be compromised.

One common solution for this challenge is to define Access Control Lists (ACLs), which contain a list of entities that are either allowed or denied access to a given resource.

For a sample Rest.li resource, this might look like the following:

greeting:
(client-service, GET, ALLOW)
(client-service, PUT, DENY)
(admin-service, GET, ALLOW)
(admin-service, PUT, ALLOW)

In this case, client-service can read but not write, while admin-service can both read and write to Greeting.

The concept of an ACL-based authorization system is simple enough, but can be a challenge to maintain at scale. At LinkedIn, we have over 700 microservices, which communicate at a total average rate of tens of millions of calls per second. Our volume of internal traffic is also growing rapidly. New features means more services collecting and accessing more data; new services or call paths require new ACLs. As more members join the platform and their usages increase, the overall volume of requests ramps up as well.

Therefore, in order to authorize the volume of requests at LinkedIn’s scale, we need to:

Check authorization quickly
Deliver ACL changes quickly
Track and manage a large number of ACLs

Making authorization checks quick

We provide an authorization client module that runs on every service at LinkedIn. This module is used to decide whether an action should be allowed or denied before proceeding. We have integrated the client into the basic service architecture at LinkedIn so that new services pick up our client by default.

Given the ubiquitous nature of our client and its authorization checks, latency is critical. A network call can’t incur another network call to check authorization. If the check were to take even a full second, every request would see this delay and the delay would compound across the site, preventing any chance at scaling.

Therefore, when a service receives an inbound request, it needs to already have the information it needs to make a decision; in other words, all relevant ACL data must be kept in-memory by the service. Since any given service only needs to know the rules governing its own behavior, the actual amount of data needed is small enough to be cached in-memory.

Delivering ACL changes

Since we can’t retrieve ACL data synchronously, we rely on a periodic refresh rate in the background. At intervals, every client will reach out to the ACL server to update its ACLs. We need to maintain a fast enough cadence for changes to be realized, but this inevitably produces more load. We discuss this further below.

We have hundreds of thousands of resources which are protected by ACLs, and those ACLs all need to be stored somewhere. We use LinkedIn’s Espresso database, which provides a simple interface with reliability by design, remote replication, and scaling capabilities.

We also have a Couchbase cache in front of Espresso, so even on the server side, most of the data is served from memory. Having a cache helps with our latency and scalability, but it also poses a problem—what if an ACL is edited and the cache goes stale? Our solution is to use a Change Data Capture system based on LinkedIn’s Brooklin to notify the service when an ACL has changed so we can clear the cache.

Tracking and managing

Every authorization check is logged in the background. This allows us to debug, to analyze traffic in our ecosystem, and to support any necessary audits or investigations. We leverage LinkedIn’s Kafka message queue to allow for asynchronous, high-scale logging. Engineers can gain insight into a service by checking the data we push to our inGraphs monitoring system.

While the enforcement is done by each application in a decentralized manner for scalability reasons, we store the ACL data at a central point: the ACL server.

The ACL server publishes a REST API—using LinkedIn’s rest.li framework—to create, delete, edit, and read ACLs. End-users typically manage ACLs through Nuage, our cloud management interface, or a command line tool.

Centralized control of ACLs allows us to nimbly enforce new security policies through a normalization process. For example, in the event of an ongoing production issue, engineers can be granted temporary access to all resources they may require for debugging. Such access grants are subject to approvals and are controlled to expire after a short time.

An overview of our authorization architecture

Future improvements

Above, we discussed one challenge of the system: increasing the refresh rate also increases the QPS and throughput on our service and downstream Couchbase instances. We’re looking to build out a push-based solution that can dramatically decrease the time for an updated ACL to be delivered, while also reducing the total amount of traffic between ACL clients and the server.

Our system tracks and generates a vast amount of data about LinkedIn’s deployments and infrastructure. We aim to leverage this data set to draw novel insights about LinkedIn’s backend.

Acknowledgements

The ACL system would not have been possible without the tireless and dedicated work of our team members: Himanshu Sharma, Alok Yadav, Tanmay Khemka, Hao Zhang, Matej Briskar, Yang Wang, Wenjun Gu, and Ming Liao. Several other engineers have also provided critical improvements: Peter (Yuefeng) Lee, Siddharth Agarwal, Pathik Raval, and Wei Zhang. Finally, we would like to recognize the support of our management: Chad Lake, Bobby Nakamoto, and Chris Peterson.

Topics: Security Scalability Data