Faster and Easier Service Deployment with LPS, Our New Private Cloud

Steven Ihde

Principal Staff Engineer at LinkedIn

April 6, 2016

Since 2003, we’ve re-architected our hosting systems four times to meet the needs of our rapidly growing systems and member base. At the same time, we’ve gone from less than a hundred services in 2009 to thousands of services and hundreds of thousands of service instances today. However, we achieved that scale at the price of wasted hardware and people hours. It was taking us days or sometimes weeks to push a new service, and required coordination with multiple teams. To solve this, we needed to automate this process at huge scale.

Today, I’m happy to introduce LinkedIn Platform as a Service (LPS). LPS presents an entire data center as a single resource pool to application developers. It allows them to deploy their own applications in minutes, with zero tickets. This allows developers to focus on building, not wasting their time finding machine resources or waiting to deploy. LPS has also reduced the hardware footprint for some workloads by more than 50 percent. In short, this new internal platform allows our engineers to be more productive, flexible, and innovative while also being cost effective.

Last May, Alex Vauthey best summed up the goal for all of LinkedIn’s infrastructure when he discussed how LinkedIn aspires to create systems that are elegant and refined to the point of being invisible. LinkedIn's Tools, Service and Presentation Infrastructure, and Data teams provide the platform that the rest of the company uses internally to create amazing products and great experiences for our members. So how do we make these backend systems “invisible” to the people that work on them, without adding complexity or impacting performance?

Over the last few years, we’ve been exploring ways to get more resources out of a smaller hardware footprint while at the same time increasing productivity by making our software stack more application-oriented. Our bet was that by abstracting the problems of deployment, resource provisioning, and dependency management at scale, we’d be massively increasing productivity for our software engineers and SRE team members.

We asked ourselves what this ideal hosting environment would need to do in practical terms and came up with the following criteria:

Enable service owners and engineers to manage the lifecycle of their own services
Relentlessly optimize our infrastructure for maximum performance while maintaining robustness and adaptability
Automatically compensate for human error and unexpected events by bringing more applications or resources online to maintain high availability
No “hacks” or extra technical debt incurred to achieve the above points

When we started working on this new platform, we had already created several advanced systems that made our services much more elegant, automated, and user-friendly for engineers. Services like Nuage, inGraphs, and AutoAlerts provide the functionality to automatically provision data stores, provide operational and performance metrics, and monitor applications to ensure that new application instances are spun up when they are needed.

In other cases, we built on proven technologies to create a new capability that could realize our infrastructure vision. For example, when we reviewed incorporating containers into our hosting infrastructure early last year, we reviewed emerging and established open source options like Docker and LXC. But after examining many of the tools available for deploying and running containers at the scale we needed, none of the existing solutions were a great fit. In some cases, the solutions available either focused on problems that we’d already solved or required us to jump through too many hoops. Then there were questions about how we could integrate these projects with systems we’d already built for application deployment and management. As a result, we decided to follow in the footsteps of previous projects developed at LinkedIn like Kafka, Rest.Li, and others. We set out to build our own system, using both significant open source components and our own established technologies.

Another key enabler of LPS is the transition to a next generation of data center architecture. The new ultra-low latency data center design we’ve adopted allows service instances in any part of the data center to communicate with each other seamlessly. This is in contrast to the higher latency architectures used by most public cloud providers.

LPS is the next evolution of LinkedIn’s hosting environment. It would take more than one blog post to give an overview that does justice to all of its systems. In future installments, we'll cover the main building blocks of LPS. For now, here’s a quick overview of four parts of LPS that I believe are particularly interesting:

The first, and in some ways, the most prominent net-new technology in LPS is Rain. Rain is LinkedIn’s answer to resource allocation and containerization, which uses Linux cgroups and namespaces directly, and also takes advantage of libcontainer via runC. With Rain, applications no longer need to ask for an entire machine as a unit of resource, but instead can request a specific amount of system resources. These application resource requests are used to pack multiple applications on a single machine. The machine is shared, but resource availability is guaranteed for each application using Linux cgroups. Applications are also deployed with appropriate levels of failure zone diversity. Such sharing leads to better overall hardware utilization, better operability due to automatic failure zone placement, and end user ease in getting new requests for resources satisfied. It is designed not only to provide a resource guarantee and security isolation for applications but also to integrate seamlessly with our existing infrastructure.

Rain is already in use for automated application deployment at LinkedIn in a preproduction environment. Before deploying Rain, it sometimes took more than two days to deploy a service. After Rain, that time was reduced by 95 percent, to 10 minutes.

As your scaling needs increase, you not only have to react quickly to changes in application demand within a single host, but across the entire fleet of applications deployed in multiple data centers. To automate this process, we’re developing RACE, the Resource Allocation and Control Engine. Like the admiral of a navy, RACE manages our fleet of applications by directing them in response to failures or other unexpected events such as demand surges. For instance, it can scale up application instances in case of sudden spikes in incoming load or in reaction to an AutoAlert that traffic is being diverted from another data center (a failover scenario). It can also scale down instances when they’re not needed. All completely automatic. All without human intervention.

Once RACE scales applications down to free up resources, what do you do with the newly-available online resources? One capability that engineers have often asked for is the ability to spin up a large number of hosts for short run experiments and other jobs. As we began to use Rain and RACE, we realized that LPS already includes systems that create and run jobs (LID), allocate resources (Rain), and record the results (using another LinkedIn internal tool). The only missing piece is an orchestrator, to tie together these systems. Enter ORCA. ORCA is the next generation system we’re developing as part of LPS to replace our existing system for provisioning short run jobs and to conduct testing. ORCA is already running 2,000 jobs every day, a number that will likely reach 50,000 jobs by the end of the year.

True to its name, this is the “conductor” of the LPS symphony. Maestro provides a global view of the LPS system. With that global view also comes global control, allowing Maestro to manage every aspect of an application's configuration on our platform. Intended as a "one stop shop," Maestro maintains a persistent store of settings and configurations for a platform-enabled application. This persistent store is what we call the “blueprint” for an application. Holding true to the definition of the word, the blueprint provides the data, the plan, and the execution model for deploying applications to LPS. The blueprint defines the aspirational state for an application, and Maestro “conducts” by taking the necessary actions to make reality in the data center match the aspiration.

By building control plane APIs for every one of these systems, we can automate the process of responding to events like sudden spikes in demand or network interruption. By the time all of the systems that make up LPS are complete, our entire hosting environment will exist as a single holistic system where our internal users can bring a new service online with only API calls, not multiple JIRA tickets. It will completely automate the management of every application at LinkedIn, allowing our engineers to spend more time developing new applications and services.

Eventually, we expect to have every application at LinkedIn run on LPS. Note that I wrote applications—not everything will go on LPS. For instance, we don’t expect to immediately migrate our Oracle databases to LPS. However, most applications developed by our engineers and associated infrastructure services like Kafka and Pinot will eventually all find a home on this new platform.

For now, LPS is a proprietary system for solving specific inefficiencies that we identified within LinkedIn. This is a conscious decision—we are committed to contributing clean, well-documented code that can have a substantial and lasting impact in the broader open source community. In the future, we will review what parts of the system could and should be released under an open source license.

Much like our open source efforts, part of the motivation behind this post is to share what's going on at LinkedIn Engineering to generate discussion and feedback from our peers in the engineering community. We look forward to hearing from you.

LPS is a collective effort across the engineering and operations organizations and an investment in our future. Although there are many more people who were involved, this project would not have been possible without key contributions from the following people: Stephen Bisordi, Allan Caffee, Goksel Genc, Ibrahim Mohamed, Jason Johnson, Jason Toy, Manish Dubey, Nate Woodhams, Nishanth Shankaran, Sumit Sanghrajka, Tom Quiggle, Vamshi Krishna Thakallapally, and Walter Fender.

Graphics by Shumyla Jan/LinkedIn

Topics: Natural Language Processing Infrastructure