Automating Large-Scale Application Build
May 11, 2016
Birth of DCManager (or DCM – Data Center Manager)
Around the end of 2012, LinkedIn decided to move away from retail data centers to wholesale ones that are built and maintained by LinkedIn. With exponential growth prospects we were already going to build multiple data centers in the coming few years. One major and time-consuming task of this operation is bringing up the application stack—which consists of over 13,000 servers running about 450 applications—from the bottom up.
Building an application stack for a complex and tightly integrated system such as the one at LinkedIn—which serves over 420 million members—requires large-scale cooperation and coordination across various teams. Before DCM, this was a fairly manual operation. Our goal at the start was to stay true to the SRE motto by automating all the repetitive tasks that end up eating the majority of SRE time and focus on what matters most: to “give our members a world class experience.”
What is DCM?
With LinkedIn’s dynamic and ever evolving infrastructure we decided that DCM should be a modular application that tightly revolves around InMapper analytics, which we will talk about in the next section. The philosophy we adopted was simple: build independent modules that complete one task at a time from end to end and report back its status to a central module that can, at any point of time, reflect the status of the buildout.
Since our app stack is a near identical copy of an existing setup, configuration settings, application logic, and capacity all remains virtually the same in the new data center. InMapper uses this logic to generate mapping, i.e., profile of the host in the existing data center including what hardware specs, what application is installed on it, whether it needs a virtual IP and ACL to be opened is mapped to a host in new data center for our new application stack.
Even before we have any applications installed, we can generate all its application configurations, create the necessary VIPs, and deploy access rule, which saves a lot of time and helps us to be ready for the fun stuff of application deployment.
While it seems like a good idea to have a labor-saving tool for generating mappings, there are some questions that we need to consider for such a tool, which take into account differences between data centers, dependencies between applications, and distributing applications on racks. Take the following questions:
- What if the new data center is not a replica of the old data center? If the new data center is a scaled-down version and 1:1, then mapping of hosts will not make sense here. This was one of the big challenges we solved when generating host mappings. For 1:1, InMapper was easy to create a clone copy of the application stack.
To solve the problem for a scaled-down application stack, we grouped the unique application set together and scaled it down to our required numbers. It not only solved the problem but also helped us maintain a minimum application redundancy. “Profile” has a total of nine instance running and if we scale down to 50 percent of the existing data center, we will end up with five instances.
- What if an single or subset of applications is not required in the new data center, but the co-shared application is needed?
- What if the entire application is not required in the new data center?
- What if a service requires a 2:1 ratio instead of the default ratio (1:1), which could be 4:1?
- How to solve a “dependent services” problem. If A depends on B, but B needs to be ready before A is installed, ordering becomes a challenge!
- How to make sure that the application are distributed evenly on the rack to optimize the network load and have redundancy in place.
InMapper has one governing rule which can not be compromised:
- Always maintain maximum rack distribution so that we have optimized our resource usage like network IO, power usage and redundancy to support maintenance. Below is example of inMapper in our topology:
What are we trying to solve?
Monotonous work is not why we have SREs at LinkedIn. Building an app stack takes a lot of time and can be a repetitive effort, so we divided it into pre- and post-build tasks. We believe that by automating pre-build tasks like generating application and/ container configuration, creating its respective ACLs, VIP deployment and installation of the application, SREs can focus their efforts on solving the bigger and more complex tasks like the post-build task of bringing up the application stack and making it ready for our members.
Metrics and impact of DCM
Before DCM, our timeline to come up with a new data center was roughly 90 days. With DCM, we have brought it down to 30 days.
The development of DCM has led to multiple benefits, including time savings, labor savings, cost savings, better management, and visibility of resources. This kind of automation frees us up to solving bigger and more interesting problems, and it is exactly the kind of "leverage" that LinkedIn engineers strive for—doing more with less.
DCM itself is now a complex application with multiple moving parts. It would not have been successful without the efforts of Himanshu Chandwani, Maheswaran Veluchamy, Suku George and the leadership of Viji Nair.