Every Day Is Monday In Operations
November 30, 2016
This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war stories, and walk through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.
We live in a world where our online services never sleep. Those of us who build and operate the services, however, do need to sleep—so ideally we build, monitor, alert on, and operate our services so that we can. Unfortunately, any service that is live 24/7 is in a state of change 24/7, and with change comes failures, escalations, and maybe even sleepless nights spent firefighting. Since our services must always be available, we must always be ready to answer the call. However, each problem solved is progress towards more restful nights in the future. Read on and we’ll share two war stories and lessons learned that explain why every day is Monday in operations.
Run the test twice, get two different results
“We cannot run the same tests twice and get the same results.” One of my very talented Engineering Directors made this comment about a pre-production test run before launching to production—not good. Blame was placed everywhere, including the code, the build system, the tests, transient test failures, configuration errors, the environment, even bad luck. But as we know, when it comes to testing, luck has nothing to do with it.
There is a famous quote in the movie "The Magnificent Seven" where Steve McQueen’s character says, "We deal in lead, friend." (Note: for the uninitiated, he is talking about bullets in guns). Well, I like to think "we deal in bits, friend," and, as it turns out, in our little binary world, we are paid to build predictable systems.
So when I heard this from one of our best technical guys, I wanted to poke my eyes out. What do you do in that scenario? Go after each and every problem that is making the system unpredictable. Constantly attack the problem (note: do not attack the personnel).
Be relentless. The investment will pay off.
Learning from “Run the test twice, get two different results”
We titled this piece "Every Day is Monday in Operations" because every time you find a problem and fix it, you are better than you were before. If you can go further, finding and fixing a systematic failure, you will be a lot better than the moment before. But there is no end to this fight, thanks to constant change needed to keep making our site and services better. That's why we have 24/7 personnel to monitor the site, alert us when problems exist, remediate issues, and find better solutions going forward.
Betting against the odds
Speaking of systematic failures, let’s discuss what happens when you have a 1% chance of segfaulting a process per day and you operate a distributed service on 70,000 servers. As anyone who runs a large distributed system can tell you, the odds are not in your favor. Bugs that seem to have infinitesimally small chances of occurring will nevertheless affect your system constantly.
For anyone who didn’t already do the mental math, that means that on average, 700 segfaults per day will occur. That is a pretty big number of broken processes to deal with, but in the abstract it doesn’t sound so bad (still only 1%, after all), so let's add some reality into the mix.
The big distributed system in our story is SaltStack, a key component of our deployment system. At first we began receiving scattered reports of Salt Minions (the distributed portion of SaltStack that runs on each of our machines) becoming unresponsive—but only to deployment requests. Everything else running on the machine continued to work. Over the course of two days, this problem grew from scattered reports until eventually 50% of all deployments were failing due to this behavior—and we didn’t know why it was happening.
We ran through the standard checklist: what had changed? We found no code updates to Salt, no updates to the deployment-specific code, no configuration changes, nothing. After tracing the deployment-specific subprocesses we finally found it: it wasn’t our code at all, or Salt’s code, but the Python interpreter itself that was segfaulting. Worse, this was affecting all Python processes across the company—but the odds of seeing it were so low that our deployment system (which was constantly spawning Python processes) was the only place it manifested.
It took us 750 engineer-hours, but we were able to mitigate this problem and identify the cause: a change we didn’t know about was made to a related system responsible for distributing shared libraries. This change triggered a bug in a the deployment system allowing a race condition which allowed the Python interpreter could read half-written files from disk. Only later would the Python interpreter segfault when it actually tried to call code from these modules.
Once we discovered the half-written files, it took about 20 minutes to gather debug information and fix the situation on a broken host. Once we fixed this problem, we saved the company about 233 engineer-hours per day. In four days, the time investment had paid for itself.
Learning from “Betting against the odds”
Every day things change, every day things break, and every day is Monday in Operations. We found a bug that had long existed and simply wasn’t being triggered until a change exposed it. By fixing this systematic issue, we made our situation much better. Sporadic and supposedly unpredictable failures plaguing Python processes all over LinkedIn vanished overnight.
Throughout this process, the engineers did their best and they did not give up. After many red herrings, false leads, and reminders to each other that the code doesn’t lie, they got to the bottom of it. If the code worked fine yesterday but stopped working today, that means we haven’t cast our net widely enough to find the problem. Once we identified a subsystem that could have this effect, the code explained everything.
We didn’t know about a change to a related system. Once we knew to look there, the time it took us to isolate and fix the bug was only a few hours. We could have saved about 750 engineer-hours by having a more comprehensive list of changes that occurred during a time window.
There is never a point in the day when our systems do not change. Whether it is a major code upgrade or just the system clock ticking, change is constant. Some of this change results in failure. The high-profile failures get dealt with immediately (roll back the change), but the minor ones, like an integration test that has a 1% increased chance of failure, often slip through the cracks. These minor problems build on top of each other, resulting in a major problem with no obvious single cause. Regardless of the failure type, it is critical that we be aware of changes that occurred around the same time. Without a change set, you have a manhunt...with a change set, you have a police lineup.
Today’s stories “Run the test twice, get two different results” and “Betting against the odds” were experienced by David Henke and Benjamin Purgason respectively. To ask either of us questions directly, join the conversation here. We’ll be checking in on the comments throughout business hours, so let us know what you think!