Rain: Better Resource Allocation Through Containerization
May 10, 2016
Last month, we announced to the world our vision for application-centric infrastructure at LinkedIn. LPS, or LinkedIn Platform as a Service, will allow our developers to deploy their own applications in minutes, where it used to take hours or even days. This lets them focus on what they do best—building, instead of wasting time finding machine resources or waiting to deploy. In turn, this ability to focus on developing new member-facing apps and features will increase innovation within our company.
Containerization is currently an area of much interest and innovation. That makes Rain, our answer for both resource allocation and containerization, one of the most prominent net-new technologies we’ve unveiled as part of LPS. In this blog post, we will go into detail about what Rain is, how it works, and why we developed it. As with our other blog posts on LPS, we are sharing this information with the community with the hope that the lessons we’ve learned spark further conversation about containerization at scale.
What is Rain? How Does it Work?
As described in Steve Ihde’s previous post on LPS, Rain is a containerization system that uses Linux cgroups and namespaces directly and also takes advantage of libcontainer via code contributed to the runC project. The containers themselves are designed to not only provide a resource guarantee, but also to provide security isolation for different workloads that share the same underlying resources.
Rain goes beyond being a simple container by also allocating resources from the common available pool of resources in LinkedIn’s data centers. It works with LinkedIn’s existing hardware inventory management system (inOps) to monitor the sum total of resources available within the data center. Another part of LPS, the Resource Application Control Engine (RACE), requests resources from Rain. In turn, Rain then uses another existing system, the LinkedIn Deployment System (LiD), to assign available resources and deploy application images. Tight integration with a proven, existing ecosystem allows us to deliver a lot of value quickly and reliably. It also means that we don’t need to make any changes to our existing systems for: cost attributions to a specific application owner, existing spare resource pool management, or forecasting systems. Since Rain works with all of our existing systems, we can gradually increase the use of Rain without disrupting our day-to-day operations. Finally, integrating with inOps ensures that critical, day-to-day machine management activities in the data center—such as upgrades and repairs—can continue without any changes.
With Rain, applications no longer need to ask for an entire machine as a unit of resource. Instead, an application can request a specific amount of system resources that meets its precise needs. Applications request an amount of CPU and memory (today), and disk space and I/O (in the future). These “application resource requests” are then used to pack multiple applications onto a single machine. Having several applications share a machine has several benefits, including better overall hardware utilization, better operability (due to automatic failure zone placement), and improvements in satisfying end users’ new resource requests, and cost reduction.
There are two general types of applications handled by Rain: those that are short-lived and those that are long-running. A build, test run, and data analysis batch job are examples of short-lived applications. Applications serving requests to linkedin.com, on the other hand, are long-running applications. These two kinds of applications request and use resources very differently. Rain, however, allows us to mix these types of applications in a single pool of shared resources.
Typically, it is hard to guarantee resources on a single machine, as several different applications compete for CPU cycles and networking bandwidth. But with Rain, resource availability is guaranteed for each application using Linux cgroups.
In addition to resource guarantees, applications also need security isolation. This is necessary to protect applications from other, potentially compromised applications, and to limit human operators’ access to data. Operators should also be able to manage only the applications they are responsible for, and should not be allowed to manage other applications sharing the same machine. Such security isolation is provided with Rain by using Linux namespaces.
Rain also considers which applications are on machines that share the same hardware dependencies. For example, Rain can ensure that multiple instances of a sharded-system don’t share the same power supply or top-of-rack switch in a data center. Rain works with inOps to make sure that applications are deployed with an appropriate level of failure zone diversity.
One key aspect of LPS is a focus on the applications we want to deploy, rather than the machines needed to run them. This is different from how AWS (EC2) or OpenStack operate by default. EC2 and OpenStack APIs let users request machines or resources to run whatever they want without consideration for what is going to use those resources. Rain, on the other hand, offers resources to a particular application through its understanding of that application’s resource requirements. A closer parallel of this service would be Google App Engine. This allows Rain to allocate resources only for valid use cases, to be able to account for the resource usage, and to reclaim resources when the use case is no longer valid. Given the application-centric mode of operation, Rain can further determine if a specific resource request is reasonable for an application.
At LinkedIn, applications are built and packaged with full dependency modeling. This includes not only in-house or third-party software dependencies, but also dependency on any specific Java, Scala, or C runtime libraries. Each packaged application is a “fat” binary that is self-contained. The only necessary files not packaged with the application are its configuration files (since those are different for each environment) and any system or member-generated data that the application consumes. Furthermore, all applications assume a uniform OS image on the host they are deployed on (so OS packages are not included in “fat” binaries). OS-level utilities, daemons, and shared libraries are not part of “fat” binaries. Rain uses a standard deployment system to deploy these “fat” application packages with their corresponding configurations. If and when there is a need for applications to support heterogeneous OS images, the same dependency modeling can be extended to include parts of OS packages needed by the application.
For comparison, Docker and rkt offer a way to create layered images, where the underlying OS can be abstracted away and applications need to provide their own OS images (with all the right OS packages) to deploy. In order to abstract the application from the underlying OS distribution and changes, its code, OS-level library, and tool dependencies must be packaged into an image. Finally, layered images generally come at a performance cost, or need the very latest kernel that leverages overlay filesystem extensions which are currently under-development (and will need more time to mature).
Essentially, LinkedIn does not need to solve for this requirement due to the way typical applications are packaged. It’s also worth noting that the current dependency modeling and packaging scheme is a good software development practice, which LinkedIn began following before any of the current image formats were formalized. We will continue using our “fat” application packages for now, keeping an eye on this rapidly maturing ecosystem and evaluating which parts we can adopt without compromising on stability, scalability, or performance.
Once the application has its required resources and those requirements are stable, there is no friction in the application’s day-to-day operations. Our existing systems already allow us to change versions, update configurations, and restart applications in a smooth and efficient way.
However, with Rain we saw the following opportunities to increase our engineering and operational productivity in several different scenarios:
- Failing an application over or moving it around for hardware maintenance.
- Scaling resource footprint to account for seasonal or feature-triggered traffic changes.
- Onboarding a brand new application into the test and production environment.
Each of the above activities requires acquiring resources (on a temporary or permanent basis). Rain’s API and standard out-of-the-box utilities provide our engineering and operations staff with a single command for acquiring or releasing resources. Since the introduction of Rain in our environment, we have noticed operations productivity improve for the class of operations noted above, as well.
Rain offers us a unique way to use the same underlying hardware at a much higher level of utilization. This has translated directly into cost savings across applications that use Rain for their resource needs. As the above graph shows, our initial findings indicate that resource utilization roughly doubled as we switched to Rain. Not only is this good news for cost savings, but it also creates a significant leverage for us to be able to do more with the same (dense) data center footprint ... and with only an incremental increase in power consumption. Data center and power costs can get very expensive at or above a certain size, so being able to serve the needs of more applications within the same footprint is a major win as well. Although we don’t have figures to share yet, we believe that using fewer hosts at a higher utilization rate will allow us to realize opex savings as well. Finally, it’s worth noting that these new utilization figures are significantly higher than the current industry average.
The majority of applications that we run are stateless applications—ones that do not use local storage to record session data. These applications are easier to provision with Rain. With more complex systems—such as Galene, Kafka, or Voldemort—there is usually a local state that needs to be established on any newly-provisioned host. These applications also require insight into rack locality so that they can do shard distribution and failure zone placement efficiently. Rain will tackle this class of application in the future.
Finally, when it comes to mixed workload, Rain will need to model quality of service guarantees and offer a way to schedule high and low priority workloads within available resources. This idea may also extend to modeling preemptable and non-preemptable workloads.
Allan Caffe, Sumit Sanghrajka, Pankit Thapar, Bryan Ngo, Daniel Sully, Dmitry Pugachevich, Stephen Bisordi, Jason Johnson, Sergiy Zhuk, Mike Svoboda, Zaid Ali, Steven Ihde, Ibrahim Mohamed, Nishanth Shankaran