Coding Conversations: Interviews on Replacing Infrastructure Systems at LinkedIn

Adam Heller

Software @ Google

December 18, 2018

LinkedIn’s infrastructure has grown remarkably over the years, and that has kept the Production Infrastructure Engineering teams on their toes. Growing quickly means we’ve had to learn to adapt quickly. While most of our systems are solid and ready to scale, from time to time we find ourselves supporting systems that struggle to meet increasing demands.

It’s common advice to write software that can support 10x the needs you have today. At LinkedIn, that amounts to maybe a few years of growth. With the number of systems we support, planning for scale has become a continuous process. As a result, any engineer working to improve our systems may before long find themselves asking: When is it better to scrap and replace a complex, mission-critical system?

You'll often hear that you’re better off incrementally improving the system you have rather than replacing it, especially if it's complex, and moreso if it's mission-critical. And if you don't understand the system well enough to fix it in the first place, you probably don't understand it well enough to replace it.

Yet systems here do get replaced and rewritten. Over the years, significant investments have been made to replace critical systems, and many appear to have worked out well for LinkedIn (and often for the open-source community as well). Survivorship bias aside, this trail of successes seems to contradict the common wisdom to improve rather than replace.

Curious, I seized the opportunity to interview a few of my co-workers at LinkedIn who had rewritten or replaced a few of the less-public, but still crucial chunks of our production infrastructure software over the years. I wanted to find out what motivated their decisions to replace these systems, what kind of environments allowed these choices to be made, and in turn, shed light on some of our organization’s lesser-known projects. Here’s what I learned.

The first LinkedIn-wide internal DNS system

LinkedIn comes from humble beginnings as a startup. And as you’ll find at most startups in the early days, decision-making trade-offs sometimes favor fast delivery over robustness. When the startup grows (as LinkedIn did), this leaves a trail of technical debt that someone eventually needs to pay down. The following story may be one of the most clear-cut cases for showing the potential benefits of a full system replacement that you're likely to find.

This story comes from one of my teammates, fellow software engineer Andrey Bibik, who joined LinkedIn at the tail end of its startup-hood.

“DNS was managed in two systems. One was just editing RCS files and loading them into BIND … and the other was a proprietary DNS system run on a separate Windows machine. There was no validation whatsoever. If BIND didn’t spit any errors from the file, it was fine. However, it was very easy to make a large error. People would upload old versions of files and delete loads of DNS records.”

This sort of system was manageable up to a certain point, but LinkedIn quickly outgrew it. It was especially difficult to imagine supporting a cloud environment on these systems, with automated IP allocations and various automated DNS manipulations.

I asked Andrey if he faced any resistance in replacing multiple production DNS systems. He did in fact meet some resistance, although surprisingly it mostly came from himself (“If it’s working, nobody wants to touch it, and I probably should not touch it”), and less so from others (“Well, it is mostly working. Do you want to spend more time on it?”). In any case, the overall sentiment was that it needed to be done. The shortcomings of the existing systems were too great to ignore, and too difficult to work around. So, Andrey, working with another early member of my team, Dima Pugachevich, architected and built the core system required to consolidate, validate, and operate as the source of truth for all internal DNS.

This system is now part of inOps, our data center infrastructure management system. It’s a system replete with the checks and balances we were sorely lacking, and validations against the complex models of our infrastructure we maintain. Many services across LinkedIn critically rely on inOps, and we’re very grateful to Andrey and Dima for having helped build a strong team to support the continued improvement of this highly-available system.

Giving Autobuild a voice

“Building and imaging a server in our data center used to be a manual process … we could only build one rack per day,” said Nitin Sonawane, staff software engineer and co-creator of LinkedIn’s Autobuild system, our end-to-end automated build system for our data center servers. “We started by developing small scripts to automate pieces of the process, and it evolved into a fully automated system."

Autobuild started as a set of improvements to a pre-existing production build process, but that project had a host of problems. For instance, the minimal boot image used for initial configuration bootstrapping had some significant issues. “The DiscoveryOS scripts would fail sometimes, and we wouldn’t know. There was no logging, the build just stopped,” Nitin said. This left his team in the dark, having to manually figure out where things went wrong. Since the rest of the system was otherwise working as designed, Nitin came across some opposition to rewriting these scripts, in the form of comments like “we don’t really need it,” “it already works,” and “this big of a change is risky.” Some of these rewrites would have entailed big changes indeed and integrating these changes with the existing system would have made it an even bigger challenge.

But with the large amount of time the operations team spent debugging these failures, Nitin decided this work needed to be done, and he began working on his build system whenever he could. Nitin prototyped a solution to make the DiscoveryOS scripts persistent until successful, announcing their state to a central dashboard, and transformed the processes to be automated, monitored, and debuggable. But it required a significantly different implementation from what already existed. When his prototype was eventually revealed to his peers, it was clear that the originally-proposed rewrite was necessary. By then, his work was nearly ready to go to production.

With support of his then manager, Sergiy Zhuk, Nitin continued his work to create a system that defined and verified build-readiness to improve the likelihood of build success. Nitin’s tool was appropriately named, “Build Readiness,” and is an integral part of our build pipeline today. Justifying significant changes like these for the sake of quality tooling is, I think, a great sign of a maturing engineering culture.

From feature request to rewrite

Imagine for a moment that you want to add a great new feature to a tool you use regularly. Suddenly, you find yourself waist-deep in a system that needs significant help. You had no intention of rewriting it, and you did everything you could to avoid a rewrite. But little by little, it’s starting to look like a rewrite, until eventually, a rewrite has happened.

This is the situation Walter Marchuk found himself in while developing SysCache: a distributed search engine designed for collecting and indexing all sorts of server data for search (similar to how a web-spider collects data from a website). But SysCache had a humble start as a feature request.

“I always dreamt that, wouldn’t it be cool to automatically find and collect a file that wasn’t collected yet? To learn what users want and adapt to their requests? I started working on this feature to the ‘SysOps API’ system, and it quickly got out of hand ... I realized the system had way too many problems. Based on the logs, it was constantly failing.”

Chief among SysOps API’s issues was the nature of its build upon a single Redis server. Processes would continually push changes into Redis at high rates, pushing Redis's limits. As soon as there were two concurrent requests, the whole thing locked up. But Walter wasn’t dreaming of rewriting it yet. Not wanting to introduce a new layer of complexity to the system, he tried his best to fit all the server logic into Redis via Lua scripts. However, after running into scaling issues with this approach, Walter accepted his fate to take the system under his wing and rebuild it, trying to solve its major shortcomings. It was a huge undertaking, and as you can imagine, it wasn’t all smooth sailing.

“There were times when I had to completely scrap big portions of the code and start over because I realized it’s not going to go any further. A lot of things that I claimed would work didn’t work in reality at a larger scale. That’s where I spent a lot of time.”

Through Walter’s dogged persistence we ended up with SysCache, a tool that is crucially relied upon by many teams across our production operations. Installed on nearly all of the company’s servers, a SysCache agent collects and processes data at regular intervals throughout the day, allowing us to quickly analyze all systems, search file contents, normalize command outputs, and examine operating system attributes. And yes, SysCache can adapt to user requests by dynamically finding and collecting files that it hadn't seen before.

SysCache is quite well-architected. I truly enjoyed walking through the pieces and protocols. From rendezvous hashing for automatic agent routing, to the feature toggles that support flexible responsibilities across servers, it would be impressive work for a decently-sized team, let alone for one developer.

Takeaways

None of these rewrites were undertaken lightly—in each instance, there was a strong argument for the change, and all projects were conducted with respect for the challenges of rewriting critically important systems. “The rewrite is a very useful tool if used wisely,” said Artur Makutunowicz, a staff software engineer in networking at LinkedIn. “With the rewrite you can take into account the changed environment, all the operational experiences, better understanding of the problem space, and just use it as an opportunity to modernize the software development toolkit.”

All of these examples tell the story of people rising up and taking ownership to do what they believed needed to be done, even when that choice was not without some risk. These engineers wouldn’t settle for tools and processes that could barely do the job. They expected more—excellence of their tools, excellence of themselves, and of their teams. They worked to achieve their goals, not only for themselves, but for everyone that their work affected.

Acknowledgements

Many thanks to the engineers who allowed us to share their stories: Andrey Bibik, Dima Pugachevich, Nitin Sonawane, Sergiy Zhuk, and Walter Marchuk.

Thanks also to my peer reviewers, who suffered through some the early drafts: Pradeep Sanders, Ronak Nathani, Brian Hart, Wilson Fung, and Artur Makutunowicz. Their feedback was instrumental in shaping this blog post.

Topics: Scalability Infrastructure