Scaling AutoBuild: Our Journey Towards Delivering An Enhanced Customer Experience
March 14, 2023
Ensuring operating system upgrades are done quickly and efficiently is just one way we ensure that the underlying applications and our website are running smoothly for LinkedIn members and customers. AutoBuild is LinkedIn’s end-to-end automated server lifecycle management system. Besides building new servers and reimaging the existing server fleet, Autobuild supports rebooting, powering on/off, wiping, and decommissioning bare-metal servers. It’s also a critical product because it’s in a coveted spot along with Metal As A Service (MaaS), a self-service API for performing OS upgrades, server reboots, etc.
While we’ve come to a stage where we feel confident about our overall health and ability to handle traffic, today, it has taken us a few iterations of improvements and fine-tunings to ensure that our efforts are directed towards making this product evolve with new-laid expectations around throughput. In this post, we’ll start with how AutoBuild works to appreciate what we have today, highlight the problems we were experiencing with the existing architecture, and define the improvements that have turned AutoBuild into a highly-available resilient application for our developers.
How does AutoBuild work?
AutoBuild exposes action types (like build, ipmi_reboot, etc.) to perform on bare-metal servers via a RESTful API which enforces AD-based authentication and a more granular authorization mechanism for aptly accepting requests from clients. At a meta-level, AutoBuild is a CRUD application that interacts with a few externally managed components for daily tasks. AutoBuild’s API is a Flask server managed by Gunicorn and relies on managed MySQL-as-a-service for use as a relational datastore for persisting all incoming data. Outside the API, many independent processes function in tandem to ensure that AutoBuild delivers as per expectations. These separate processes are also managed, i.e., if any of these crash, we have a mechanism for automatically booting them back to an expected count. Let’s take a look at how we got here.
Before we dive into issues the existing architecture was experiencing, it’d be good to define two terms:
“Organic growth” constitutes the newly provisioned server inventory yet to be imaged and is contributed by forecasted YoY hardware growth. This differs from the bulk of reimaging requests, which are limited to existing servers from the fleet.
Reimage - An OS upgrade purges existing data from HDD and restates the server to a clean state with bootstrapped host-specific configurations.
Issues with existing architecture
While the scope of v1 architecture met our needs for a long time, we foresaw challenges as we continued to scale:
With "organic growth" intertwined with a barrage of new reimaging requests, sequential processing performed by internal processes could not scale - we frequently noticed message-queue crashes leading to lost messages, which often adversely affected the entire workflow.
Backend processing did not allow end-users to specify the priority for processing a request. A reimage request for a host should ideally take precedence over another that's part of "organic growth," simply on the merit of turnaround time - it’d be fair to expect a user reimaging a server to get that host back as soon as possible as opposed to a server which was recently racked. There were multiple requests to revert a request for reimaging that couldn't complete, as per the expected SLA of end-users, because service at that site was busy processing hosts recently “Discovered” / newly racked.
The architecture relied heavily on external protocols/solutions (RabbitMQ, to be precise) for solving an issue. The tricky bit about relying on a redundant dependency was that it could be tricky to calibrate, hogged system resources, and serve as a single point of failure.
Request processing from AutoBuild’s backend often became sequential with traffic spikes as a request's dependent jobs were interleaved with messages from other requests. All of these were mediated by a process. While the existing model worked fine when the expectations around the count of reimages per day were limited, this model couldn’t keep up with growing traffic demands, and an influx of image requests from both “organic growth” and reimages continued increasing quarterly.
An impetus for the v2 architecture
It’s worth noting that jobs execute across requests in parallel. With this established, we envisioned another way of request processing that enables true parallelism. The entire workflow could be taken apart as a “simple” coding problem: Given multiple linked lists, define criteria for parallel traversal while ensuring synchronization between competing processes to avoid race conditions. With all of these ideas floating about, we realized an architectural change solely dependent on a relational database to function as a queue would allow us to drop reliance on dedicated messaging queues from our architecture in its entirety. Besides the added overhead of managing these, we were moving towards a backend where processes became first-class members - they pick jobs and don’t require a joint mediator for delegation.
Some of the data required to break this workflow to a simple linked-list traversal were already there - we already persisted jobs’ status. Now onto the more challenging part: how to ensure that multiple processes of the same type are not attempting to process the same job - we did this by implementing row-level table locks per invocation of code that picked jobs for processes to work on; we started leveraging NOWAIT with SKIP LOCKED and FOR UPDATE clauses from MySQL to enforce this behavior.
Figure 1: Envisioning parallel request-processing from the proposed backend design
To understand the above diagram better, each request creates all dependent jobs at the time of receipt by API and sets the status of the first job (in order) to a state that the backend processes constantly poll. Once a process picks a job, its state changes and is no longer available for other processes of the same type. On completion, the current process updates the current job’s state and puts the next job (in order) in a state sought by corresponding backend processes. In this manner, a request ensures that all sequential jobs are performed in an expected order. With this, the backend processes can pick any job they can (based on state) and proceed with processing. In this setup, processes take complete ownership of finding a job to process, unlike our previous design, where everything had to be delegated by a dedicated process. Also, because all these interactions happen with the database, there is no additional dependency for managing FIFO ordering while processing requests.
By the end of this iteration, we gained confidence that our product could scale linearly with several processes of each type. The only caveat here was finding a balance between the number of processes of each type and the maximum number of concurrent connections allowed with the relational database. Besides architecture changes, we also introduced some critical workflows for managing requests:
Enabled worker-pool management. Now, the backend took responsibility for bringing up new instances of processes in case of a process crash - this ensured that workers were working at constant bandwidth across deployments.
Ensuring all requests from MaaS garnered a higher priority, leading to others indefinitely deprioritized. This effort was to ensure that deprioritized requests were processed with age-based criteria. Any deprioritized request’s priority would increase with time, making it eligible for processing after a certain “age.”
Added a straggler manager to deal with requests identified as stragglers - no more “zombie” jobs!
With this being deployed, our architecture started to look like this:
Figure 2: Architecture layout of AutoBuild after the v2 rollout
All AutoBuild servers (a.k.a. “AB server”) requests are routed by an “AB-router” - which makes routing decisions based on where the server (being reimaged) is physically located. Due to network-ACL restrictions, this routing mechanism is necessary because a server being reimaged/rebooted/etc. must interact with the corresponding AutoBuild server(s) over an OOB network (only accessible from that server).
In this setup, our external dependencies were limited to running a Postgres instance locally per AutoBuild server deployment. I’m sure this design is already flagging potential issues in readers' minds: there’s only one AB router globally and one AB server per data center. However, we ensured that each AB-server internally depended on a locally hosted Postgres database.
Avenues for improving v2
The original setup of AutoBuild became a bottleneck for the following reasons:
The number of servers processed directly correlated to system resources - increasing throughput would not be possible beyond a certain point. The recourse to this setup was scaling the current implementation horizontally.
Single points of failure and an absent failover mechanism posed a risk to the overall product. If there’s a site-wide outage at a data center where the router was hosted, AutoBuild will become inaccessible. All AutoBuild-servers and local databases served as points of failure as there were no redundancies - if the hosting machine went down or the database got corrupted, the entire service for that site would be affected.
An AB-server gave equal weightage to incoming requests. It was impossible to have one AutoBuild server dedicated to a particular request - this became an issue if the overall throughput for reimage requests had to exceed the current value.
Reliance on non-standard/prescribed LinkedIn architecture components added to the maintenance burden and posed concerns about their role in the proposed architecture.
AutoBuild’s server identity was heavily tied to static configurations on the router - adding/removing server(s) required changing the static configuration and redeploying the router.
At that point, we had numerous incoming requests from our biggest producers (Metal-as-a-Service, predominantly) alongside imaging requests for servers newly racked at data centers. AutoBuild had to become aware of this and implement some sort of request prioritization so that more urgent requests (i.e., those coming from MaaS for reimaging) are processed sooner than “organic builds.” Also, we thought of this as a promising avenue for exploring how we can move away from static configurations and discover a healthy AutoBuild server dynamically.
With these disparate expectations on our plate, we began to envision a more robust, resilient architecture scale-up directed towards removing single points of failures/adding redundancies and shifting from using hard-coded metadata to a model that can adapt dynamically.
In the proposed design, we introduced clusters of nodes hosting multiple instances of AB-server per cluster. The main incentive behind this structure was to create a resilient ecosystem - if one of the nodes is dysfunctional, overall processing would not be affected. Also, with this design, API instances have anycast-like behavior - any API instance can accept requests of any type. Nodes belonging to a cluster type can service only the type of requests for which the cluster was allocated, enabling request-processing-based scaling. We also modified AutoBuild’s backend to become aware of cluster type and request processing - backend code started filtering requests based on how a server identified itself (i.e., which cluster it belonged to).
Figure 3: Multi-cluster setup of AutoBuild for high availability
As seen in Figure 3, AutoBuild’s servers were segregated into three cluster types in the proposed design:
Dealing only with reimaging requests (serviced by Build Node-*),
Dealing only with “organic growth” requests (serviced by Discover Node-*), and
Dealing with requests not belonging to the first two categories (serviced by Others Node-*).
Such a structure allows us to scale up/down processing throughputs of different request types by attributing hosts to needy clusters. We pushed the mapping of clusters to request types - that they can process - to an external source so that they could be trivially configured to move servers between these clusters seamlessly. Also, we replaced reliance on a locally hosted Postgres database with managed MySQL-as-a-Service per data center - this helped resolve databases being a single point of failure in the current architecture. MySQL-as-a-Service allowed provisioning database nodes that can handle up to 5K QPS with a maximum storage of up to 1TB - these values were more than enough for deployment in a multi-node setup. MySQL-as-a-Service also took the onus of implementing data replication under the covers.
DR design of autobuild-router
With all of these improvements enabling the high availability of Autobuild, a highly-available setup of AutoBuild-router was the next piece we focused on. We decided to utilize various solutions to create an active-active deployment of routers where deployments were scattered across data centers. This was especially important because we wanted to avoid the health of a data center or a node in a data center to control traffic exclusively.
We rolled out this active-active setup of a router where the router’s CNAME was behind DNSDisco - this essentially proxies incoming requests to be distributed between active deployments. It actively checks if router instances are live using HTTP-based checks and forwards requests to whichever router that responded “ok.” We also rewrote our router, leveraging a more performant API library to ensure we’re milking as much performance as possible.
Another big puzzle was figuring out how to do away with static configurations within the AutoBuild ecosystem. At this point, we had a highly-available setup of AutoBuild servers and routers; we now had to stitch these two pieces together so that routers were aware of servers serving API requests. To do so, we created an auxiliary service that runs on AutoBuild servers and notifies router(s) of the existence of servers to it via a “heartbeat” - it checks if hosted API can accept requests over HTTPS and if so, requests router to consider this host as a possible option for routing requests. Also, at the router level, we evaluate the freshness of these ”heartbeats” to ensure that the router always knows the state of the cluster before forwarding a request - we chose to keep the freshness rate at three minutes if an AutoBuild server does not send a “heartbeat” to the router(s) over three minutes, a router stops considering it as a viable option.
Performance-testing v2 architecture
Once we finalized our design and gained enough confidence in the overall workflow by running integration tests, we rolled out these architecture and code changes across data centers. We were lucky enough to get some hosts for the performance testing of our new architecture. We began with two AutoBuild-servers and deployed a local router to distribute requests between these nodes. For this testing, we created a shared instance of a locally deployed MySQL server from which both hosts would pick requests. We created requests with batches of hosts and submitted them in parallel. We intended to gauge the number of requests AutoBuild can process in parallel. In all fairness, a server’s reimage is contingent on many factors outside of AutoBuild hence we limited our experimentation until AutoBuild was the sole driver. Here are the results of our performance tests:
Figure 4: Graphical representation of expected reimage count based on internal experimentation with v2 design
With only two active servers, we could determine the count of servers successfully processed by AutoBuild to be >60K if the submission size was ~100 hosts (in parallel). Increasing the host count for parallel submission further lowered this number to ~40K because we saw API drop requests if bombarded in parallel. This was expected behavior since one can only scale API threads to a limit beyond which the behavior is likely to become erratic - request acceptance depends on specific backend tasks occupying threads while “inflight.” Therefore, if traffic to the API is unhinged, no API threads would be available to cater, and we’d notice a spike in HTTP-5xx from the server side. To compensate for exponential submissions to AutoBuild, the number of active concurrent connections to the MySQL database increased. This laid the foundation for the overall throughput of this new architecture.
Multiple servers host the API in our production deployment, among which requests are scattered. Our existing count of processes and corresponding concurrent database connections is manageable for the underlying database. This didn’t come to us by default; we had to build a few clever database indices to ensure databases stay performant in the face of traffic. In the future, if we need to scale AutoBuild to handle more daily requests, we expect it to require increasing the process count and the number of allowed concurrent connections to managed-MySQL instances.
V2’s YoY performance
This new backend design and horizontally-scaled AutoBuild have been live since the start of 2020. Here’s some data reinforcing our vision of AutoBuild concurrently handling many reimage requests and new server builds:
Figure 6: Traffic segregation based on request origin
We looked closely at the maximum number of servers that could be built/reimaged daily. So far, the maximum number of concurrent requests successfully processed is ~3K per day per data center. The current traffic directed at AutoBuild is a small percentage of what our performance testing predicted that this new design could handle.
Insights & Telemetry
Finally, as with any distributed system, growth in AutoBuild's complexity spurred an increased need for insight into the health of the different components of our infrastructure. To accomplish this, we turned to two open-source technologies, Prometheus and Grafana, to serve as the bedrock of a telemetry platform that would allow any of our applications to integrate and start reporting metrics of their own seamlessly. We created a small HTTP layer to be deployed on any machine that wanted to integrate with our telemetry platform to serve as an aggregator for Prometheus to scrape metrics from. Then, we created a small library to be imported by the products that wanted to produce data to our telemetry platform, which thinly wraps Prometheus' client libraries to explicitly target the endpoints of our aggregating HTTP layer. With both pieces in place, we created an interface for anyone within the team to emit metrics to Prometheus from any of our server-builds-related applications with just a few lines of code.
This effort has offered us real-time insight into our ecosystem of products that we have never had before. Grafana, the second of the two open-source technologies, is where we bring this information to life by using its visualization capabilities.
Figure 7: Snapshot of UI, which depicts the internal state of multi-cluster deployment
When we began, we needed a notion of high availability, and we faced limitations on the number of servers we could build. We also started from a point where it often took hours to deploy the product across data centers reliably. With a concerted effort, we are at a point where we are prepared for the impact of ever-growing traffic actively, and we have a one-click deployment across data centers. We can also estimate how long a request takes to complete by dynamically computing turnaround time and a code base with at least 70% code coverage from unit tests (compared to 0% initially). It’s been a seminal experience working as a team on solving technical challenges that presented themselves along the way and eventually coming to a stage where we feel reasonably convinced about the future of this product. We have a reliable and robust ecosystem that can scale horizontally with traffic and can be used to upgrade servers and image new inventory with high confidence - having such an ecosystem, in turn, allows application owners to upgrade and allocate servers promptly with convenience and hence continue delivering an impactful experience to end-users. Surely enough, this isn’t the end-all to improvements that can be done, and we are ready to explore more avenues in the future.
This effort was only possible with a team of high-caliber engineers who added momentum to our overarching goals. Thanks to Phincy for helping conduct thorough integration testing of the new codebase and closely working on a well-planned rollout - it took a lot of work to gain confidence; it would have been much harder without this concerted effort. More shoutouts to Steve and Jayita for driving and delivering an insightful telemetry integration into AutoBuild - we can only attempt to fix visible and reproducible issues, and this integration gives us that visibility. Thanks, Phirince and Tony, for contributing and driving improvements that have led to a one-touch deployment model for AutoBuild - we’ve come a long way since our manual deployment days! Last, a big shoutout to Nitin, Nisheed, and Milind for supporting this initiative and believing in us and our vision.