What Gets Measured Gets Fixed

Benjamin Purgason

Engineering Director, Server Engineering Productivity @ YouTube | Inventor

December 5, 2016

This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war stories, and walk through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.

Today’s stories “The 10g Massacre” and “TS3” come from David Henke and Benjamin Purgason respectively. To ask either of us questions directly, join the conversation here. We’ll be checking in on the comments throughout business hours, so let us know what you think!

“What gets measured gets fixed” is a famous adage taken from a company that knew something about measurements: Hewlett Packard. Long before printers, computers, and the internet, HP built test and measurement devices (e.g., oscilloscopes). They knew what they were doing—if you can measure something, you can reason about it, understand it, discuss it, and act upon it with confidence. If you cannot measure something, not only are you unable to fix it, you’re also unable to really understand if it’s even broken or not.

The 10g massacre

One month into a new job at another technology company (not LinkedIn), my team faced a large scale upgrade of the databases that sit at the core of the systems serving hundreds of thousands of advertisers.

In one large step scheduled to take place over a single weekend, the plan was to perform the following conversion/migration:

SPARC to Intel
Single computer to Oracle RAC (multiple nodes)
32 bits to 64 bits
Solaris O/S to Linux O/S
Big Endian to Little Endian (actual byte representation, i.e., where is the most significant byte)
EMC to NetApp storage
Oracle 9i to Oracle 10g

The migration and target system was tested and retested and load tested. But, of course, testing means different things to different people. After a weekend of migration work, the new databases were turned on. Once again, without peak load, they performed well. And then, once again, under load, Oracle started crashing, and crashing, and crashing.

I have never been fond of relational databases in general. I did not know what an ORA-600 was. I do now. It is a general error for “we have absolutely no idea what is wrong, but we are crashing the instance to preserve the data integrity.”

During the first 48 hours, we fought the software valiantly as our databases kept crashing. Naturally, there was no simple return to the prior (working) state, as fundamental changes had been made to the current environment. We had a tiger team from Oracle in shifts working hand-in-hand with our engineers. The 64 bit release of Oracle 10g on RAC using Linux turned out to be buggy, and our load testing did not uncover this. What a nightmare! For the next two months we worked diligently to apply patch after patch to our new footprint. As we had to do in “Betting Against the Odds,” we worked through the change set, resolving the problems caused by each change and eventually restoring normal service.

Learning from “The 10g massacre”

"Site up" and MTTR took a severe beating here. The biggest failure was the lack of a reasonable roll-back plan, so we were in a fix-forward hell, relying on a third party to sort out the key software patches required to move us forward. The second biggest failure was our failure to understand that "what gets measured gets fixed" could also be stated as “what doesn’t get measured gets missed.” We did not understand how to measure traffic at peak, and therefore we did not test it well enough. We also changed many things at once, and running Oracle on the target footprint was not well-characterized. On a positive note, the team never gave up, in spite of the incredibly difficult circumstances, and we used our learnings to ensure that this would never happen again (all future major migrations had a roll-back plan, were thoroughly tested for performance as well as functionality, and in many cases could be done in phases).

TS3 (Tools Site Status Standup)

Now let’s talk about another kind of “massacre.”

After receiving scattered feedback from our engineers that our tools were a pain to work with, always had outages, and were frequently broken, we decided to investigate. A survey was put together to learn more; it was long, fairly time-consuming, and asked questions about everything you could imagine, to try to pin down why the perception of the tooling was suffering.

The summary metric was measured via NPS. For those of you unfamiliar with the term, a Net Promoter Score (NPS) is a metric designed to gauge how enthusiastic people are about recommending something to others. It is a single number in the range of -100 to 100. It’s fairly difficult to get a high (+50) score because of how it is phrased and measured. A canonical NPS question is: “How likely are you to recommend X to your friends and colleagues?” In essence, it is trying to determine if you like something so much or are so happy with it that you’d put your name and reputation on the line to recommend it. This question is asked on an 11-point scale, and only scores of 9 or 10 add points, whereas scores of 0-6 subtract points from the overall score.

When the results were calculated, they hit us like a truck: -46 NPS. After digging in more, a common theme appeared in the written responses: reliability. So what did that mean? It meant that the survey results agreed with the rumors that had prompted the survey in the first place. Further, the root of most of the problems was the fact that the tools we said could be trusted to do something didn’t always do that something—usually resulting in a fairly wide outage called a Global (Unexpected) Change Notification, or GCN.

Not content just to rely on the new, human-powered data source the survey had given us, we also mined our existing GCN tracking data (it wasn’t exposed easily or regularly prior to us doing this) to see if it lined up. It did. In fact, if we lined the GCNs up end-to-end, there would have been one GCN in progress for more than six months out of the preceding twelve.

You might be thinking, “Wow, that is a terrible discovery.” I disagree. This is what we needed: two sources of data that lined up and agreed, giving us evidence that we had a problem. Better, they showed us a place where we could invest and get a gigantic return on investment.

To help fix these problems, we founded TS3 on Jan. 2. We set the success criteria for the meeting as “the reduction of user impact”. Specifically, the success criteria was reduction in mean time to detect an outage (MTTD) and the mean time to resolve the outage (MTTR). Additionally, we disavowed “total number of GCNs” as a success metric. By setting the incentives up this way, we made it a race to create the GCN and accurately communicate there was a problem before users reported it, and then resolve it as quickly as possible. If we had kept total number of GCNs as a success metric, we would have encouraged people to try and solve the issue before an internal user noticed. If you know about a problem, your users also know about it—they just haven’t complained yet.

At TS3, all on-call team members for the teams responsible for our internal tooling meet once a day for 30 minutes. We talk about every single site issue that’s happened, how long it took to detect the problem (MTTD), how long it took us to resolve the problem (MTTR), the follow up items (how we’re going to prevent it from happening again), and we go over all planned manual changes for the day. We started measuring everything we could think of about how we dealt with these issues and reporting on it to the whole company.

The results were dramatic. By measuring these things, we showed their importance to the organization. Our engineers finally had the data needed to make better decisions and take more intelligent risks. Comparing January to November of this year, we reduced MTTD by 85.39% and MTTR by 85.57%. We changed our culture to include the concept of site up-time, as well as MTTD and MTTR explicitly. What gets measured gets fixed.

Learning from “TS3 (Tools Site Status Standup)”

I saw a T-shirt recently that said: “Being an engineer is easy. It’s like riding a bike except the bike is on fire, you’re on fire, everything is on fire, and you’re in hell.” There are a lot of engineers (and leadership) out there that will simply accept that the world is chaos, and that that’s just the way the world works. It becomes normal to have problems (much like the “frog in the water” story we’ve already told), and that is the real problem.

When you don’t want to look at a problem, that’s when you know you need to. When we measured the problems we had, we showed our partners across the company and our own engineers that the reliability of the tooling was important. By consistently measuring the effects of our actions, we made what was once an accepted reality (that the tooling was unreliable) into an actionable thing that could be changed. That is the expanded version of why “what gets measured gets fixed” works.

Topics: Developer Experience/Productivity Infrastructure