MTTD and MTTR Are Key

Benjamin Purgason

Engineering Director, Server Engineering Productivity @ YouTube | Inventor

December 12, 2016

This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war stories, and walk through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.

Today’s stories “The Canary in the Coal Mine” and “Sequencing versus parallelization”, were experienced by David Henke and Benjamin Purgason, respectively. To ask questions directly please join the conversation here.

Mean Time To Detect (MTTD) and Mean Time To Restore (MTTR) are metrics used to describe how long it takes to discover a problem and how long it takes you to restore service relative to the start of the outage. The shorter the MTTR, the less time spent in outage and the more availability your site retains. Given that services will inevitably break at some point (Every Day is Monday in Operations), we need to be adept at restoring service as soon as possible. The service triage and restoration lifecycle is made up of several steps: detection (requiring monitoring/alerting), escalation, debugging, and remediation. Each segment of the triage needs to be measured for efficiency and effectiveness in order to keep MTTR as short as possible.

The canary in the coal mine

This story might also be appropriately titled “To make an omelet, you’ll have to break a few eggs.” No matter how well and thoroughly we test for functionality and performance in our pre-production environments, we will never catch everything because we can’t perfectly mimic production’s unique properties. The reasons vary (some include: traffic level, experiments being run, user behavior, and number of changes being prepared for release) but that doesn’t mean we can’t do our best to get ahead of the problem. We just have to adapt look to unusual sources for inspiration and adapt patterns proven to successfully mitigate risk to our situation. One such pattern is the “sentinel species,” also known as the “canary in the coal mine.”

In the old days, mining disasters involving numerous deaths were common. These were often due to the collection of toxic gasses (such as carbon monoxide) in the mine. Carbon monoxide is colorless, odorless, and heavier than air. In essence, if carbon monoxide was going to kill anyone, it would be a miner who was deep underground in an enclosed space with poor ventilation.

In the early 1900s, a sentinel species (a species of animal who is both more susceptible than humans to and shows clear signals of being affected by a dangerous environmental condition) was identified for carbon monoxide: the canary. The implementation of the sentinel species pattern was simple, but effective: miners would carry a canary with them into the mine. If the canary began showing signs of distress, it was their warning sign to escape the mine quickly before the humans became affected by the dangerous environment.

In the old days at LinkedIn, we used to ship code to all of production at the same time, following the pattern that many companies use today (early integration testing, then full production deployment). This pattern is woefully insufficient. It is capable of catching the obvious stuff (new build doesn’t work, API incompatibility, etc.), but it will never be able to catch performance, load, or user-behavior driven problems. This wasn’t good enough for us, so we began looking for a better solution.

A couple of years ago, we adapted the “sentinel species” pattern to our own production releases. It turned out to be relatively simple, as long as we enforced backwards compatibility from one version of code to the next.

First, we test our code as thoroughly as possible in a private testing environment referred to as “early integration.” Once code is ready to be deployed to production, we start with a canary. A “canary” in this context is when we deploy a service to a single production node, and then watch this node closely for half an hour to ensure it does not behave in a significantly different manner or show signs of distress. If the canary remains healthy, we deploy the updated code to the rest of the nodes for the service. However, if for any reason the canary shows signs of distress, failure, or behaves differently from the untouched nodes, we roll back immediately, and a disaster is prevented.

Learning from “The canary in the coal mine”

Using a canary is no longer a nice-to-have capability for internet companies—it’s a necessity. The traditional operational models of software system development, qualification, deployment, and operation simply cannot keep up with the velocity of change required for an internet-based company to thrive.

Put simply, when the minimum required velocity for change is high enough, you have no option but to take intelligent risks. It is far better that a single node die (for a very short amount of time) than an entire cluster. The key to using this technique is that it is a controlled experiment. The criteria for success is known in advance, metrics are in place, and a quick rollback plan is available.

Sequencing versus parallelization

It was 9:25 a.m. when I sat down in our daily site status review meeting. I was prepared to talk about the three major outages JIRA had experienced the previous day. We had squashed three different problems, including unexpected user behavior, memory exhaustion, and even a file descriptor problem. We had not, however, found root cause. During our attempt to solve the file descriptor problem, two teams had made changes around the same time that JIRA recovered. It was unclear what the impact of each change was.

At 9:30 a.m., the leader of the site status meeting called the start of the meeting and hit refresh on the JIRA dashboard we used to run our meeting: Error 502, Bad Gateway. Refresh: Error 503, Service Unavailable. Refresh (30 seconds later): Error 500, Internal Server Error. I stood up, called for a war room, and asked for representatives from some of the teams to join me.

A few minutes later, we had a war room up, and representatives from all teams were on a conference call. We had been fighting this cluster of issues for the past day, and by now we had identified a number of possible causes, ranging from database performance degradation to JVM garbage collection, from NGINX misconfigurations to a potential memory inconsistency issue on the host running our service.

Acting as the coordinator for the war room, I parallelized the search for possible causes. I had each team set out to confirm, deny, or label as suspect the possible causes we had identified from the prior day’s investigations. No group was allowed to implement changes—just to report in on the status of each possible cause. As each group checked in, we filtered our list of causes, coming up with about 15 that were suspect, 0 confirmed, and more than 70 that were not contributing.

Once we had the suspect causes, we tested each of them in sequence (only one being tested at any point in time). We had a team “fix” each suspected cause, all teams monitor their areas, and then accept or reject the fix. The total cycle time was around five minutes per “fix.” In the end we identified the cause: an ordering change of the parameters of the calls made to the database caused a slight increase in database latency which, in turn, caused bad JVM behavior. We had our service restored in just 20 minutes after we developed our list of suspect causes.

Learning from “Sequencing versus parallelization”

When triaging the site, there are times when it’s best to sequence events and times to parallelize the work. It is important to know when to do one instead of the other. If there is a site problem and we do not know the cause, it is normal to spin up people from multiple teams in an effort to find out what’s wrong. This is a good thing because it lets us quickly determine which areas are healthy and which are not. Further, it lets us narrow our search for the problem: the network team verifies network health, the database team confirms the database is responding fast enough, the app team confirms new exceptions are not appearing in the logs, etc.

When it comes to actually implementing changes to the site, we need to sequence the work carefully. If two teams make changes simultaneously, it is often unclear which one had an impact and which one did not. At first glance, sequencing changes could look like slowing down the restoration process, but in reality this is where things get faster. All eyes are watching one change, so there is less chance of missing the impact each change has (good or bad) and far less chance of going down the rabbit hole of trying to untangle the impact caused by multiple simultaneous changes.

When it comes to MTTD and MTTR, lower is always better. The canary concept helps us prevent a site-wide service outage. After all, no outage is better than even the lowest measured MTTR. For issues that do manifest, the key is to move quickly through safe areas in the investigation (finding possible causes) while sequencing the testing of possible solutions so that you do not waste time trying to understand how more than one change affects your service.

Topics: Developer Experience/Productivity Infrastructure