Scalability

Operating system upgrades at LinkedIn’s scale

Introduction

Completing recurring operating system (OS) upgrades on time and without impacting users can be challenging. For LinkedIn, completing these upgrades at a massive scale has its own complexities as we’re often facing multiple upgrades. To secure our platform and protect our members’ data, we needed a fast and reliable OS upgrade framework with little to no human intervention.

In this blog, we’ll introduce a newly developed system, Operating System Upgrade Automation (OSUA), which allows LinkedIn to scale OS upgrades. OSUA has been used for more than 200,000 upgrades on servers that host LinkedIn’s applications.

Key features

After learning the lessons from the past upgrades, here are four remarkable features that OSUA provides.

Zero impact

One of our key values at LinkedIn is putting our customers and members first. In engineering, this means site-up (linkedin.com can be accessed and served anytime from anywhere securely) is always our first priority. OSUA is designed with mechanisms to ensure that no user-facing impact is risked during OS upgrades on servers. Therefore, zero-impact always comes first before any other features in design decisions.

High throughput

LinkedIn has a growing footprint in its on-prem serving facilities that consists of hundreds of thousands of physical servers. To perform a timely upgrade, OSUA has a high upgrade throughput by leveraging parallelization and some applications’ ephemerality that does not sacrifice site-up or cause any performance regression in the middle of upgrades. During the most recent fleet level upgrades, OSUA was capable of upgrading more than 10x more hosts per day than the old mechanisms before. Currently, we are working towards the goal of upgrading 2x of what OSUA can do now.

Support heterogeneous environments

The LinkedIn serving environment consists of various natures that include, but are not limited to, stateless applications, stateful systems (explained in detail later in the post), infrastructure services, etc. They are hosted in multiple locations, managed by a variety of schedulers, ranging from Rain and Kubernetes to Yarn, and most are deployed in a multi-tenant fashion. OSUA currently supports approximately 94% of the LinkedIn footprint that is made up by these systems and its coverage continues to increase.

Automation, autonomous, reduce toil

In the past years, LinkedIn has undergone a few company-wide server OS upgrades for purposes like tech refreshes or improving our platform’s security. Our previous processes for OS upgrades were highly human resource intensive, which added a significant amount of toils to the teams involved. To overcome this, OSUA is designed to be a hand-free, self-serve service where users only need to click a button (or submit a CLI command). Any failures caused by upgrades will be reported back quickly to the corresponding teams. To customize the upgrade processes for different teams, OSUA also allows users to set up and manage their own policy of upgrades.

Technical approach

At a high level, a server (as an example) needs to go through the following three steps sequentially for an OS upgrade:

General steps of hosts

Figure 1: General steps of hosts undergoing an OS upgrade

  1. Drain: Gracefully stopping or evacuating the applications that are running on the host serving traffic, sometimes with extra steps of initiating data balancing for stateful systems.
  2. Upgrade: A server can go through a full reimaging to upgrade images that have the main partition cleaned but the data partition retained or through a yum update way of reimaging that has both main and data partitions retained. At the end of an upgrade, all servers need to have machine health checks, depending on their hardware and system specs, performed and passed before serving any payloads after upgrade.
  3. Recover: Applications will be redeployed onto the server if they were allocated to it and not moved elsewhere as part of the drain step, possibly with data rebalancing and handling for stateful systems. For servers that had ephemeral applications evacuated, they become ready for any resource allocation of new workloads.

These three steps seem simple to be done quickly but are much more complex at scale. To upgrade the entire LinkedIn fleet and provide the features listed above, OSUA is built at the orchestration layer to manage and coordinate the upgrades with the following highlights.

Unified workflow

To support heterogeneous environments while maintaining common upgrade processes and experiences, after numerous internal researches, request gathering, and case studies, we’ve developed a single workflow that could be a one-fit-all solution with portability and flexibility for various applications and resource schedulers’ characteristics. Having one workflow, like OSUA, also helps to reduce onboarding and education efforts.

When a host/server/vm is the working unit in the upgrade process, the workflow can be shown as follows:

Unified workflow of steps

Figure 2: Unified workflow of steps of hosts undergoing an OS upgrade with optional pre-/post-step

Here are some design decisions worthwhile to highlight:

  • Customized handling for drain and recover phases provide applications with the ability to handle necessary tasks before and after upgrades in a customized way. These abilities are essential to preparing stateful systems for an upgrade and recover back to the pre-upgrade ready-to-serve condition.
  • Drain and recover phases are abstract. As they are tasks formatted as jobs encoded in rest.li schema for multiple consumers (resource scheduler in this case) to work on, any consumer can be plugged in and execute the tasks of its kind in a different way according to their own needs.

Impact analysis and batching

At LinkedIn, all OS upgrades are performed while live traffic is served. Therefore, during the drain phase, OSUA can only take down a computed subset of hosts that are submitted into its pipeline as a way of making use of capacity redundancy/reserve to ensure that linkedin.com has the needed capacity served.

OSUA leverages an internal standardized impact approval system (Blessin) that allows application teams to specify acceptable impact as a percentage of total number of instances / capacity as an absolute number, or consults customized built API, provided by individual service controllers (often cluster management services), to obtain information if a group of instances can be taken down and when.

While processing each host, OSUA figures out all of the application instances on the host and validates based on the configured rules in Blessin to determine if the application instances can be taken down. If all of the application instances on a host can be taken down, the host is picked for OS upgrade. The following figure illustrates a simplified example of determining if a host can be taken down for upgrades or if extra coordination, such as waiting, is needed.

Graphic of impact analysis process

Figure 3: Example of impact analysis process

Some stateful teams have a condition where a group of hosts within a fault zone (a logical group where all-or-none hosts in it can perform maintenance all at once) should be upgraded together so the overall cluster’s rebalances can be kept minimal. In such a scenario, OSUA tries to drain, upgrade, and recover such hosts as a single batch if all of the hosts in the batches are approved.

To maximize throughput, the impact analysis and batching mechanism is streamlined and conducted by intervals with parallelization to timely refresh data (such as capacity and upgrade status) and continue to pick hosts for upgrades.

Cross-system operation coordination

OSUA won’t be the only system doing maintenance on the site. There will be constant deployments taking place on application instances initiated by other maintenance activities such as data defragmentation, repartition on stateful systems, network switch upgrades, etc. Such maintenance activities, along with routine code release actions, have to be coordinated well so that, on a host, only one activity can take place at a time. Otherwise, OSUA could pick a host to drain and at the same time a routine code release could take place, which affects the cluster health of application instances too and consequently the ultimate impact in total would exceed the allowance.

Workflow of OSUA acquiring a lock from Insync

Figure 4: Workflow of OSUA acquiring a lock from Insync while a system tries but fails to get the lock

To avoid this race condition, our SRE teams are working on a centralized locking system (Insync) where application instances and hosts can be locked for certain maintenance or release activities to ensure only one activity can take place at a time using a first in, first out (FIFO) method. A host that is locked successfully is considered down for maintenance in effective availability calculation during impact analysis. OSUA picks a host for maintenance only if the effective availability of each of the application instances is within the threshold configured by the owners of the application, and if the host is not locked already for any other maintenance activity.  

Customized execution handling for stateful systems and more

While keeping the OS upgrade workflow unified that most applications can leverage, there are a number of systems that need customized handling in the format of pluggable add-on steps to the workflow because of their systematic complexities. One of the examples is the stateful system.

A stateful system is one where the operation of the system depends on a critical internal “state.” This state could be data or metadata that acts as the memory and history of the system at that point in time. The LinkedIn technical ecosystem comprises many stateful applications, especially on the data tier. These systems often have custom workflows that need to be executed before taking a node out of rotation (a.k.a. pre-steps) or bringing them back into the cluster (a.k.a. post-steps). These workflows vary quite a bit across the fleet and pose a bigger challenge for an automated OS upgrade setup.

In the past, engineers would need to run a number of administrative tasks manually or use scripts on to-be-upgraded hosts to ensure all necessary pre-steps are completed. Additionally, the problem is often compounded by the need to migrate data out of the to-be-upgraded host and rebalance the data across the rest of the cluster so that a minimum safe number of copies is maintained within the cluster. OSUA has to solve these diverse sets of problems while ensuring that no human toil is involved during the upgrade process.

To address the diverse demands from these systems, OSUA aligns towards a solution that is uniform in approach and still allows flexibility to these stateful systems to automate for their unique requirements for upgrades. As a result, OSUA leverages an in-house platform, STORU. STORU was initially developed with the idea of automating large scale operations for switch upgrades, but the system was extensible and supports customized automation before/after operations.

For pre-steps and post-steps, OSUA leverages a feature of STORU, custom hooks, which enables application owners to build custom application logic that would be executed before and after the OS upgrade process.

Graphic of custom hook execution of pre-step

Figure 5: An example of custom hook execution of pre-step

In this section, we will focus on custom hooks and explore some of its salient features.

  • Pre- and post-steps: As discussed earlier, a pre-step of the custom hook allows custom code execution to get hosts ready. This is usually required to safely take hosts out of rotation with optional customized extra steps. A post-step is a mirror image of the pre-step that is executed after the OS upgrade is complete to revert the outcomes of the pre-step.
  • Custom hook execution order: OSUA allows custom hooks to be executed in various stages, which are defined relative to the application deployment step during the upgrade process. Both pre- and post-steps can be executed before, after, or both before and after application (un)deployment. This provides flexibility for stateful applications to configure how custom code execution can be invoked.
  • Custom parameters: OSUA also allows application teams to define and pass additional parameters to custom hooks when submitting host(s) for upgrade. This helps custom code handle specific nuanced cases that might apply to a subset of hosts in the fleet when they are submitted for upgrade.

Auto-remediation

At scale, there will always be a certain percentage of failure that can occur during any steps involved in the OS upgrade process ranging from unsuccess of application uninstallation and deployment to server breakdown. OSUA is equipped with mechanisms to detect, analyze, triage, and remediate failures automatically, which greatly reduces human toil and facilitates company-wide hardware repair and refresh.

Self-contained

OSUA by nature is an infrastructure service. To be self-contained and avoid circular dependencies, which can result in cascading outages, we build OSUA on top of a limited number of internal control plane services and don’t depend on large scale data plane systems if we are able to find alternative solutions. For example, for event messaging needs, instead of using LinkedIn’s ready-for-use Kafka clusters, we implemented a lightweight restful based pub-sub mechanism within OSUA. This is to avoid the circular dependency such as Kafka uses Kafka (as a OSUA dependency) to upgrade host OS of Kafka that can lead to cascading failure when an upgrade is unsuccessful.

A recent LinkedIn OS upgrade

Since introducing OSUA, it has successfully performed more than 200k upgrades at LinkedIn, with more than seven million system packages updated and 18 million vulnerabilities addressed on these servers, with no external impact to LinkedIn customers and members from outages rooted from systematic processes. Further, the engineering effort from engineering teams to spend on OS upgrades has been reduced by 90% from previous upgrades that had an even smaller scale. The daily peak upgrade velocity is a 10x improvement from previous upgrades.

Now, many LinkedIn engineering teams come to this single platform to delegate OS upgrade operations worry-free.

Next steps

OSUA has shown success recently on LinkedIn’s on-prem infrastructure upgrade. However, increasing upgrade velocity with lower failure rate and less human intervention will be our continuous focus.

Acknowledgements

OSUA could not have been accomplished without the help of many engineers, managers, and TPMs across many teams. The engineers who have made contributions to OSUA are: Anil Alluri, Aman Sharma, Anant Bir Singh, Barak Zhou, Clint Joseph, Hari Prabhakaran, Jose Thomas Kiriyanthan, Junyuan Zeng, Keith Ward, Nikhita Kataria, Parvathy Geetha, Ronak Nathani, Ritu Panjwani, Subhas Sinha, Sagar Ippalpalli, Tim McNally, Vijay Bais, Ying He, Yash Shah, John Sushant Sundharam, and Deepshika. Special thanks to our TPMs Sean Patrick and Soumya Nair who have been steering this project from Day 1. Also, we’d like to thank the engineering leadership, Ashi Sareen, Mir Islam, Samir Tata, Sankar Hariharan, and Senthilkumar Eswaran, who have been providing continuous support to building OSUA. Additionally, we would like to thank Adam Debus, Justin Anderson, and Samir Jafferali for their reviews and valuable feedback.