Metal as a Service (MaaS): DIY server-management at scale

May 11, 2023

Guaranteeing that our servers are continually upgraded to secure and vetted operating systems is one major step that we take to ensure our members and customers can access LinkedIn to look for new roles, access new learning programs, or exchange knowledge with other professionals. LinkedIn has quite a large fleet of servers on-premise that depend on internal tooling to ensure they stay on the latest operating systems. This post will introduce an internal tool that serves as an interface for managing servers' lifecycles at the LinkedIn scale. We will emphasize the rationale behind this tool's existence, the path to making it available for our major consumers (i.e., site reliability engineers), and how we rearchitected and scaled this service from only being able to accept a maximum of 72,000 server submissions per day to having no limitations on acceptance rate.

The need for a solution

Before Metal as a Service (MaaS), all the server-upgrade requests were relayed to engineers (from the Production Systems Software Engineering organization) via Jira tickets. It became their sole responsibility to drive this effort manually. Despite being well-defined, it required a transfer of control from host/pool owners to an engineer from a different organization. Often, unexpected delays crept in due to working through communications over a ticket; some issues often required a more hands-on approach and data-center technicians to intercede to take things forward. The biggest challenge thus was aptly delegating server lifecycle management to corresponding SREs and pool owners; removing an extra layer of coordination would help isolate issues quicker and give the owners a sense of control over the servers they were responsible for.

Background

To set the context for the upcoming discussion, we will define the following terms:

Overlapping hosts

One of the core concerns raised during interactions with partner SRE teams was how this product would determine overlapping requests, i.e., how can this product isolate and deter members from attempting different sets of operations on a standard set of hosts? Before the alpha release, we added a check verifying if any hosts in a currently submitted batch are part of another. If so, the current batch would be invalidated by providing appropriate messaging to the end user.

Reimage

An OS upgrade that purges existing data from HDD and restates the server back to a clean state with host-specific configurations bootstrapped.

There was a version of “reimager” before MaaS. While the previous version of “reimager” was an effective tool, it was not queryable and required specific prerequisites to be manually fulfilled. With the overarching theme of enabling Site Reliability engineers (SREs) to take ownership of this entire process, we had to think outside the existing solution, which led to designing a tool that could allow direct access to SREs for managing server lifecycle. A good option for this was exposing different functionalities via an API. Another shift we hoped to bring in was the concept of a batch; this was explicitly being investigated to break the hard dependence on Jira tickets as the source of truth. Batching was a relevant concept because it was a common grouping semantic under which host owners could define a standard set of configurations to apply, such as which OS release to upgrade all hosts with, which action to perform on all the hosts, etc. We also wanted to ensure that the right set audience has access rights to interact with MaaS as the majority of actions performed by this product (like reimage, reboot, decommission, disk-wipe, etc.) can be destructive. The anticipated outcome was meant to be a thin layer that would perform data validation, update external sources to reflect that submitted servers are about to be mutated and submit these hosts to a downstream service for further action.

Aloha MaaS!

With a basic understanding of how server-upgrade workflow needed to evolve, the PSSEBuild team interacted with various SRE teams to gather requirements that best fit their usability and necessities. With the completed design, we wrote an API that SRE teams could directly interact with. Metal as a Service (MaaS) was a self-service API that allows end-users to upgrade (reimage), reboot, power on/off, wipe attached disks, and decommission servers in batches. At the heart of it, MaaS, designed as a CRUD Flask-based application, would be managed by systemd or something akin. This application would expose a RESTful API that authorized users would have access to exercise. We chose user association with an internal Active Directory (AD) group to enforce authorization; all authentication requests were also AD-based. With a basic structure in place, we worked on exposing endpoints for:

Submitting new requests to process could be any from “reimage,” “reboot,” etc. On successful submission, MaaS would return a batch-id that users could use as a reference for future interactions.
Querying the status of a batch relative to batch-id and hostnames
Querying batches submitted by AD username
Querying statistics of batch runtime
Canceling batches that were accidentally submitted
Querying backend service to gauge the count of active server upgrades across data centers

More than just exposing functionalities over an API, we also aimed at improving the visibility aspect of the entire pipeline. We leveraged Iris-based alerting and an internally available event bus through which SREs could know the state of their submissions without continually interacting with the API. Iris-based alerting is more granular and synchronous; submitters can be notified via diverse delivery methods (email, slack, SMS, etc.). Iris would ping them once a batch was successfully accepted and when a batch completed execution. Often there is a requirement to triage a wedged submission; we enabled MaaS to create tickets routed to an internal team that addresses one-off failures while tagging the submitters into the same so that they are aware of the progress.

As MaaS’ adoption grew, its architecture and deployment scheme had to evolve to ensure high availability while reducing the human intervention required for product release. We will now describe the evolution of this tool’s architecture, the challenges faced with growth in its adoption, and share some results of its overall performance over time.

Path to a minimum viable product

Figure 1: Architecture layout of MaaS at GA (hosted out of a single host)

Any software goes through multiple iterations of improvements and releases before being deemed stable. For MaaS, the starting point was co-hosting the web service, relational database (Postgres), and Redis-based caching layer on a server. At the time of the alpha release, we were mainly focused on getting a bare-minimum product out with which clients could interact and give us actionable feedback and one that could appropriately forward requests to a downstream service (AutoBuild). At this stage, all the interactions with MaaS were over HTTP (admittedly, this was far from ideal as clients’ credentials were exchanged over plaintext). The application was managed via systemd and required manual intervention for deployments. MaaS interacts with many external systems to validate the state of servers and mutate properties in some of these systems. Wherever a state mutation is expected, MaaS needs to interact using credentials that have been authorized. Because we needed to get the ball rolling, we added the credentials to an internally distributed GPG keystore. When the service restarts, an engineer receives a prompt to input their authentication credentials for the GPG keystore. On successful authentication, MaaS would be allowed access to necessary credentials for interacting with authorized external services.

Another point to highlight here was that the submission pipeline had specific components which could only process one request at a time. In Figure 2, steps 1 through 6 had to conclude before a new request could be processed. During the minimum viable product (MVP) phase, we empirically determined this value to be approximately two minutes, which implied that MaaS could only process one request per two minutes.

Diagram of Request-processing workflow between client’s submission and MaaS

Figure 2: Request-processing workflow between client’s submission and MaaS

The primary rationale behind this design choice was data consistency. In the extant deployment model, multiple API workers functioned in tandem without sharing common memory or database connectors. The underlying database must be consistent for each worker thread to compute “overlapping_hosts” correctly. If multiple copies of the worker threads were trying to calculate this result in parallel, an incorrect computation could occur from dirty- or unflushed- concurrent database sessions.

Issues with MVP

Unavailability of service redundancy and manual failover mechanism: Traffic was served from a single deployment. In case of an irrecoverable failure of the primary deployment node, service failure would have to be initiated manually. We realized the importance of MaaS moving to an active-active deployment scheme, where losing one node does not result in a pathological service interruption.

Reliance on non-standard and aged locally managed services (with limited redundancy) and special hardware: At the time of release, as it predominantly intended to have a minimum viable product available, we focussed on having locally available external dependencies; Redis for caching and reverting API responses and PostgreSQL for containing and managing the bulk of data. There was reliance on an unmanaged data layer.Redis (for caching) and PostgreSQL (as primary datastore)served as single points of failure for this product. There was no redundancy for data stored in Redis; any data corruption would halt the cache layer, causing API response times to spike. Managing data replication for data in PostgreSQL could have been more robust.

Dependence on clunky credential management: MaaS’ deployment had a strict dependence on the presence of a GPG-secured keystore. It required an engineer (with access to this keystore) to be logged in at the time of deployment.
All the interactions were over HTTP: While this was enough for MaaS to be usable, it inherently posed a risk due to an unencrypted transfer of credentials. MaaS’ API should enable all interactions over HTTPS.
Globally enforced two-minute backoff per submission: As traffic to MaaS continued to ramp up, its behavior concerning rate-limiting was leading to a bad user experience. We envisioned that MaaS should allow limitless submissions without forcing end users to write fancy contraptions/wrappers to make MaaS accept their requests; this had become a common pain point for end users and warranted remediation. MaaS needed better means of managing bandwidth while keeping its backend performant.

Decoupling overlapping-hosts computation from submissions

While the existing solution allowed us to compute and deliver overlapping-host isolation confidently, with rising traffic/adoption, throttling became a common pet peeve among our users. We oriented ourselves towards ensuring they could submit to MaaS at a frequency of their preference rather than being hindered by a global backoff. We realized the potential of a design change that could break the dependence of data validation on request submission; in essence, we intended to break cohesion between the client-facing piece from the more compute-intensive verification and processing one. It was proposed that MaaS be bifurcated into API (which would accept requests without rate limits) and the backend that would periodically read from a distributed messaging queue and perform necessary operations before processing a request in its entirety. We decided to leverage Kafka as a distributed messaging queue. The choice of Kafka mainly stemmed from its widespread use within LinkedIn and its dedicated support SLA. After wiring our API and backend with Kafka-REST, MaaS could accept as many user requests as needed. The backend processed them sequentially while maintaining FIFO ordering, which was necessary for computing overlapping hosts for incoming requests. The overall workflow started to resemble the following figure:

Diagram of Proposed changes to MaaS’ internals for removing global submission backoffs

Figure 3: Proposed changes to MaaS’ internals for removing global submission backoffs

Architecture 2.0

Diagram of Proposed architectural changes enabling high availability of MaaS

Figure 4: Proposed architectural changes enabling high availability of MaaS (DC* = datacenter)

We ensured that the new architecture had multiple active deployments spread across data centers that could entertain traffic. The catch was that multiple operational deployments could cause issues due to the backend’s asynchronous tasks interacting with external sources. Many interactions are not idempotent and could cause race conditions or failures. To ensure correct behavior while maintaining multiple server footprints, we devised a mutex-inspired design to leverage relational datastore as a source of truth and row-based locks for enforcing isolation. Async tasks (per deployment) would verify if another “copy” were active before marching ahead. This safety net gave us confidence that the parallel execution of async tasks would be synchronized per type and would not cause either MaaS or external sources to be incorrect.

We had to figure out how to distribute traffic evenly between active deployments. We considered a few alternatives, like:

Hosting nodes with HAProxy and reverse-proxying incoming requests through them seemed plausible. Still, it would have required us to maintain and monitor another set of hosts for having such a setup live. While only some critical services within LinkedIn heavily depend on HAProxy, adopting such a pattern might demand more from us than we were hoping to solve this problem with.
Creating a virtual IP and using ucarp for performing an automated failover in case the service on one host was out. While a good alternative, the version of ucarp that we experimented with posed a pretty stringent restriction on its usage; all the hosts behind the virtual IP should be in the same network subnet, which we could not reliably enforce.

In light of our experimentation with different choices, we decided to defer traffic distribution to something commonly used within LinkedIn - DNSDisco. This internal DNS-based proxy service took the onus of performing periodic health checks of active deployment(s) and making routing decisions relative to the outcome.

We mentioned earlier that the locally hosted caching layer and relational database would not scale horizontally; we needed to remove the data layer and ensure that the same reports received consistent data across multiple deployments. With the well-understood use case, we leveraged managed data services provided by the Couchbase-as-a-Service and MySQL-as-a-Service teams for provisioning managed data sinks. All interactions with MySQL and Couchbase were based on well-defined authentication, authorization, replication, and automated failover protocols.

We had moved away from a bulk of self-managed components and could comfortably host our service on a standard application node with 64GB of memory. The last bit to cover was our move away from the GPG keystore for managing service credentials. We moved all of our credentials to an internal service (KMS) which would allow access to the same via RESTful calls based on application certificates and ACLs associated with those secrets. When this move was complete, we could restart the service or deploy without worrying about the state of the local GPG keystore or manually managing GPG keys. This move was also a precursor and enabler of one-click deployments for MaaS - the possibility of doing such deployments became reasonable once the mandatory “human involvement” piece was moved out of the frame.

Last but not least, we enabled HTTPS-based communication with MaaS. While we’ve maintained AD-based authentication, we also enabled mTLS for MaaS. Now, clients can present verified certificates for exercising authorized endpoints. MaaS was placed behind DataVault; all authorization requests are currently fielded by it. This is the final touch in adopting existing tooling to meet the end instead of perpetuating tech debt.

A few crucial integrations:

MaaS leverages internal data sources for blocking requests for hosts that are deemed to be “in use”; doing so allows host owners to not accidentally pave over the machine(s) that are actively hosting application(s) or are part of an active allocation.
Hosts being submitted must have a functional IPMI console. MaaS ensures that users can enable this check from API before a submission is accepted, increasing the overall success rate of their submissions.
MaaS interfaces with a few internal queryable services that would give the most reliable information about the hosts. It aggregates these data points and uses them as a metric for creating necessary triage tickets with associated teams rather than accepting those hosts blindly and doing those as an after-effect once the associated batch inevitably fails.
Server-reclaim workflow: MaaS plays a pivotal role in the server-reclaim workflow. This workflow is automatically triggered for defunct and unallocated physical servers. An automated workflow has been defined that isolates such hosts and submits them to MaaS for starting an OS upgrade which, incidentally, reverts servers to a pristine state. Such newly upgraded servers are returned to a pool of hosts from which other users can allocate.

Insights from telemetry

At MaaS, we collect a wealth of data per submission and derive meaningful and actionable metrics using the same if represented. This is in stark contrast to how this process was conducted previously. There was no direct way of querying and publishing data that could be leveraged for building a metrics-based dashboard. To actualize this, we teamed up with a few members of another engineering team to define a mechanism for their team to pull database models from MaaS, use formulas created by our team for measuring performance, and publish requested PowerBI dashboards.

Few metrics:

The overall success rate of “reimage” requests in the last year:

Image of Sample UI representing success percentage and throughput

Figure 5: Sample UI representing success percentage and throughput

There are a lot more metrics that further segregate each batch and respective runtimes based on:

Hardware SKUs
Individual runtime, and failures of different subactions for completing an “action” (like reimage, etc.)
Quantifying different lags in between submission to MaaS and external components, etc.

If we were to sift through data persisted within MaaS, this is what the overall performance of MaaS looks like:

Image of Overall YoY throughputs of different actions supported by MaaS

Figure 6: Overall YoY throughputs of different actions supported by MaaS

Near-term plans

Some of the work we’re planning is derived from the granular data we gathered from our dashboards.

The main pain point for us is the high number of failures in the case of broken IPMI consoles. We’re exploring options to bypass broken consoles and potentially speed up overall upgrade times by conditionally employing in-place upgrades instead of a complete upgrade (which can be time-consuming due to multiple factors outside the control of MaaS).
We would focus on improving the API’s overall throughput; with time and multiple iterations of API changes, we’re noticing a relative spike in latency of overall batch acceptance throughput - while still under an acceptable range, we’d like to minimize this behavior from API.

Conclusion

We’ve come a long way from when Jira ticket conversations drove server upgrades. MaaS has been catering to the diverse needs of SRE teams since its MVP days - from server upgrades to reboots and more. MaaS evolved from a point where it could not process more than 720 batches in a day to now, where it can accept and process those many submissions in a few seconds. Albeit, this is not the end of the road for improvements to MaaS. In the grand scheme, having a performant self-service interface through which SREs can manage their server fleet’s state is a boon and help respective teams maintain their applications, eventually leading to a good user experience for end-users utilizing the benefits of a well-oiled internal mechanism.

Acknowledgments

This entire effort has been possible because the people have driven various initiatives, prioritized, and committed to perpetuating MaaS’ stability and utility. A big shoutout to Jon Bringhurst for driving the development and release of the Python library and CLI for interacting with MaaS! Thanks to the engineers who have meaningfully contributed to MaaS and upheld craftsmanship - George Tony, Tianxin Zhou, Steve Fantin, and Jayita Roy. Many thanks to Vjiay Rajendrarao for keeping our timelines in check while the project started to grow - it would have been tough to meet our commitments otherwise. Last but not least, thanks, Franck Martin, for reviewing and providing valuable inputs, and Brian Hart, Nitin Sonawane, Nisheed Meethal, and Milind Talekar, for continually supporting us and providing guidance and feedback for various improvements.

Topics: Developer Experience/Productivity Infrastructure