Infrastructure

Operating System Snapshot Automation

Co-authors: Rohit Jamuar, Tianxin Zhou

Introduction

LinkedIn has a large set of physical servers geographically spread across several locations. Every application is hosted on a physical server and is distributed and managed across one of these hosts. With a reasonably sizable footprint of servers in data centers, LinkedIn is responsible for ensuring that these hosts are always on an operating system (OS) version deemed the “latest and greatest” for all intents and purposes. The Production Systems Software Engineering (PSSE)  organization within LinkedIn has taken the responsibility of creating timely OS snapshots that are installed on these servers regularly. This blog discusses how this process was implemented and the impetus behind the OS Snapshot Automation (OSSA) project.

Historically, there were less rigid constraints around building snapshots and refreshing the same across our server fleet.  At LinkedIn, we started pursuing the creation and release of OS snapshots at a defined cadence, as it’s ideal for servers to upgrade to the latest snapshots regularly and older snapshots (with potential security vulnerabilities) to be retired. With this vision in mind, we wanted newly built OS snapshots validated once per month with due process and released the same with a tightly managed tempo. The main incentive behind creating a dedicated product for conducting these in an automated manner is rooted in improving overall operational excellence - being able to build snapshots automatically at a regular cadence would allow timely validation and release of the same to the fleet; doing so is a necessity for enabling confidence in customers that their data and private information is not exploited due to potential OS-level vulnerabilities on servers.

Motivation

Pre-OSSA, the OS snapshot process was a manual one closely tied to a handful of one-off shell scripts, and the entire ecosystem was tied to a single server. Moreover, metadata about snapshots and their respective lifecycles were stored in an internal wiki document, and there wasn’t a way to reference this data programmatically. The existing solution had challenges with its maintainability, scalability, and high availability. Another big challenge for this ecosystem was how snapshotting was conducted. It required human effort to create, promote, and deprecate a release. The current infrastructure and the processes to get a snapshot created and boot-tested could, at best, be defined as a stopgap solution, meaning that everything was manually conducted and needed dedicated full-time engineer (FTE) time. OSSA was envisioned with the requirement of making this ecosystem highly available, well-monitored, and programmatically configurable by its consumers. Aside from improving this ecosystem’s availability, we also wanted to coalesce different one-off scripts into a multiproduct to improve the code base's craftsmanship and maintainability. 

Improving data accessibility via RESTful API

Initially, processes and information were tied to asynchronous communications over Slack and Jira tickets, which lacked visibility and became cumbersome for tracking necessary information. One of the ways we hoped to solve the visibility aspect of this ecosystem was by enabling the dissemination of this information via a RESTful API, which was the first important step for OSSA to perform. This allowed us to bridge the gap between the information strewn across internal wiki pages and Jira tickets and expose OS snapshot data (release_name, kernel_version, base_release, and expiration_date) via an HTTP-GET call. Another step we wanted to explore was enabling partner teams validating OS snapshots to interact with OSSA so that they could relay their validation results without having to do so over mediums that aren’t necessarily queryable. With this support enabled, we aimed to display this data via an API call where anyone interested could make a simple HTTP-GET call and look at what teams have validated an OS snapshot and the status of validation. Additionally, we enabled support for building OS snapshots and managing their lifecycle via POST calls to this API. In the scheme of things, enabling these functionalities over an API made integration with external teams’ products feasible. It allowed us to continue working towards a solution where all disparate workflows can be triggered via a one-touch model.

When talking internally and with partner teams, it was vital that we were able to gauge the authentication and authorization of incoming POST requests to OSSA.  We decided to rely on DataVault’s token-based authorization service for this, as DataVault has a well-established ecosystem that drives the majority of authorization requests at LinkedIn and fits our expectations. We created custom ACLs, with access rights per validating team, and ensured that these ACLs are enforced by DataVault when an external user submits a POST request with a token. The individual/automation must retrieve the same after authenticating to the DataVault token service. 

High availability of API

With this design in place, the next step for us was to ensure that this API remains highly available, as we already had several stakeholders depending on the metadata provided by this API.

(DC = datacenter)

We decided to have two nodes per site and put all services running on these nodes behind ATS. Our partner teams expected traffic to be routed from outside of the environment where these servers were present, and without ATS, interacting parties would have to open network ACLs for interaction. With the service spread geographically, we had to ensure that OSSA’s API reports the same dataset, and we decided to replicate data between different data centers using GoldenGate replication.

Improved visibility into overall processing

While the API paved the way for managing the OS snapshot lifecycle, OSSA also enabled more granular visibility into the overall OS snapshot process by exposing related data via sources like Iris, and an internal event bus. We use Iris-based notifications to learn more about the state of the OS snapshot during its build, testing, and monitoring. We also emit events to the event bus for anyone to consume via an intuitive UI; external teams are not tied to interacting with the API for this information.

Now that we have discussed OSSA's API and HA design in detail, we will dive into what an OS snapshot constitutes, how we have been building it, and the essential validation performed by OSSA before creating an event.

What is an OS snapshot?

Before we dive into how an OS snapshot is built, it’s good to understand what an OS snapshot is. An OS snapshot is a collection of bootfiles (initrd, vmlinuz), RPMs, and a few extra metadata. The snapshot in “OS snapshot” comes from the fact that we take a proverbial snapshot of all of the locally available latest RPMs and bundle them together into an entity that is meant to be immutable by nature. This is deliberate because it helps us isolate issues if we can reliably install the same RPMs across different test environments.

How do we build an OS snapshot?

Our team inherited creating OS snapshots from a partner team internal to PSSE. At the time of the takeover of this effort, nightly replication of RPMs from upstream sources was preconfigured using an open-source tool (called mrepo).  For RHEL packages, we’d interact with RH7 CDN using certificate-based authorization; for CentOS packages, we’d point to a publicly open mirror (from kernel.org).  At the time of snapshot creation, we’d rely on open-source tools like createrepo and repomanage for building an OS snapshot. Once an OS snapshot is built, it’s replicated over a highly distributed yum infrastructure, and our internal server lifecycle-management tooling refers to this distributed data for triggering in-place or full-reimages of physical servers.

The ability to build snapshots was also exposed with an ACL-enforced endpoint in the API. This endpoint accepts necessary metadata from authorized users and relays that data back to backend logic, referenced when creating a new test snapshot. This data flow is crucial as we build test snapshots for different distros. For example, to build a test snapshot for RHEL7 and RHEL8, we use RPMs verbatim that were fetched from upstream, and we use a separate methodology for creating snapshots for CentOS7. While it’s similar to how we perform these steps for RH* distros, the stark difference comes from the kernel RPMs embedded into CentOS7 snapshots. 

Once a snapshot is built and replicated, the next important step for OSSA is to validate if we can install the newly minted test OS snapshot to a server. Aside from just an OS install, we validate if all of our internal toolings bootstrap the server as expected. Pre-OSSA, a dedicated engineer took over the responsibility of this responsibility and was a time sink considering the frequency with which these had to be done. This is especially true when considering the current engagement model, where multiple experimental test snapshots can be built for internal validation. We saw an opportunity to include automated boot-testing of OS snapshots into OSSA and decided on leveraging an existing product, MaaS - Metal as a Service, a self-service API that allows reimages of servers for triggering a reimage of the same.

Boot-Testing workflow

OS snapshot validation workflow

OS snapshot creation workflow

Before diving into the overall workflow, it’d be good to understand how different partner teams pitch into validating an OS snapshot. The OSSA team creates a new test OS snapshot and boot-tests it. Then the Maize team performs application testing on the test-snapshot created by OSSA and InfoSec performs vulnerability scanning of the test-snapshot. Next, the hardware and capacity engineering (HCE) team performs hardware and regression testing on the test-snapshot across multiple hardware SKUs, and the PSSE team owns the promotion of a test snapshot and deprecation of a release snapshot. Lastly, the OS Upgrade Automation team submits imaging requests with the test-snapshot under validation.

The Maize, InfoSec, and HCE teams do their testing in parallel and report back to OSSA with the result. A successful validation would be relayed back as a “nomination,” and failure would be reported as a “deprecation.” Few members from the PSSE organization have been given access to promote a test snapshot as it is meant for making the test-snapshot generally available and building a corresponding release snapshot open for everyone to install on their hosts (as all the necessary validation has been conducted). PSSE also holds the key for deprecating previously released OS snapshots. We could deprecate such OS snapshots if a new CVE is found or unforeseen behavior is observed during runtime.

The following figure describes the general workflow for OSSA’s interactions with different external teams for managing the lifecycle of test and release OS snapshots:

Monitoring

  1. With OSSA, we saw an opportunity to improve the monitoring of snapshots and the RPM-fetch process. To this point, there wasn’t a reliable way to perform this, as no source of truth could disambiguate and/or spot issues. From the perspective of OS snapshot monitoring, we had to design changes for OSSA to track missing RPM(s), missing or modified metadata, or incorrect checksum(s).

Monitoring the mentioned items not only plays a crucial role in enforcing the immutability of OS snapshots but also plays a relevant role in ensuring that what was vetted by partner teams remains the same throughout a snapshot’s lifetime. For tracking any modifications, we started computing digest per snapshot creation. This digest (JSON) would track the RPMs in a snapshot along with their SHA256. This file would be distributed with the OS snapshot and uploaded to an Ambry container to ensure that a local modification of such files could be verified against the one from the Ambry blobstore.

During this effort, we ran into a few escalations due to newly built snapshots lacking the latest versions of certain RPMs. This was yet another opportunity for improvement! We added logic for validation if the last run of upstream fetch could retrieve the difference of the newly available RPMs. If not, relevant members were notified so the underlying issue could be triaged. A scheduled task drives this check daily and notifies engineers of discrepancies.

As the data contained within OSSA and reported directly impacts various production services, we also implemented monitoring around possible data tampering for items stored in the database. We added an extra column per table, which contains HMAC-SHA256 of other columns whenever any data in a row is modified. There is a scheduled task that, at a regular cadence, iterates over these columns and matches the existing data with the one computed during execution. If there is a mismatch, it auto-disables those OS snapshots from the list of valid snapshots and notifies the developers about the data-integrity violation. Any data modification can be isolated by this means. Because we use HMAC with the private key persisted in a managed keystore, it’s highly improbable to compute this data reliably after tampering with the dataset.

Purging redundant/expired snapshots

Before OSSA, we kept creating new OS snapshots; over time, the cumulative size grew to 3TB. Many of these snapshots continued to persist as there was no clear path for retiring OS snapshots. With OSSA in place and OSSA-defined workflow for snapshot deprecation, we enabled the purge of older snapshots that are either past their expiration or have been deprecated for a while. In either of these cases, OSSA would step in and purge snapshots lingering around and add no value. In the first iteration of this process, OSSA cleaned ~500GB of redundant data and aimed to remove more redundant data by fine-tuning our expectations. This is a nudge toward operational excellence by controlling the data we continue to own and the network implications of transferring the same for replication.

Conclusion and future work

Before OSSA, information about snapshots was tied to another source of truth, which limited the number of OS snapshots that could be concurrently supported per distribution. This limitation was particularly hindering as the number of concurrent snapshots for testing and general availability could be more than one per distro, considering that we were building snapshots at a much higher frequency. Removing reliance on that SoT and solely depending on OSSA for retrieving OS snapshots’ metadata removed the hard dependency, and with OSSA, we can have as many snapshots of any type per distribution.

OSSA has emerged as a source of truth for anything and everything related to OS snapshots within LinkedIn. A product that emerged with a need to improve visibility and operability has brought itself to a point where multiple critical services depend on data from OSSA being served on demand. It also enabled (authorized) users to trigger OS snapshot builds without explicit intervention from our team and organization. A plethora of checks and padding were added to OSSA to ensure that internal processes leave audit trails and actionable HTTP responses that make interaction with an inherently complex ecosystem reliable and further reduce dependency on tribal knowledge for driving this process from end-to-end. While our significant deliverables are live, we are still seeking to improve the product to ensure that it continues scaling with requirements. Some of the near-term requirements are to add support for partially validating and conditionally releasing snapshots, exploring templatization of the snapshot creation process for different distros, upstream sync containerization, and overall improvement to the ecosystem.

Acknowledgements

OSSA has culminated into the product it is today because of continuous feedback and guidance from many engineering leaders, technical program managers, and engineering managers that helped mold design considerations and deliverables. Shoutout to Steve Fantin for driving the work to enable repodB monitoring and interfacing OSSA with Ambry and to Jayita Roy and Khushboo Kuchhal for scaling service to a new data center and adding a dedicated staging environment. Many thanks to Cynthia Arriaga and Carlton Giles for keeping our deliverables under close watch and helping us unblock issues by effectively liaising with external teams. Thanks, Franck Martin, for supporting this initiative and helping us roadshow this product into a viable product at the heart of multiple design and development endeavors across LinkedIn. Thanks, Nishan Weragama, Adam Debus, and Sean Patrick, for providing valuable feedback during the initial design and helping us stay aligned with the Fleet Compliance initiative. Many thanks to Nitin Sonawane and Milind Talekar for supporting this effort.