Failure is Not an Option

Benjamin Purgason

Engineering Director, Server Engineering Productivity @ YouTube | Inventor

January 16, 2017

This is the final post of the series “Every Day Is Monday in Operations.” Throughout this series we’ve discussed our challenges, shared our war stories, and walked through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here.

If Operations fails, so does your company—it is, as you might say, mission-critical. The delicate trust a company holds with its customers can be shattered by a single sustained outage. Just look at one outage from 2012 which cost millions of dollars per minute while a bug was in production. When a major incident occurs, no customer cares about who was at fault, the extenuating circumstances, or how you’re going to do better next time. They leave, wondering why they trusted your company in the first place. In Operations, failure is not an option.

As we’ve retold our war stories over the course of this series, we can’t help but be reminded of this heated exchange from the movie “Apollo 13.” It’s a variant on a conversation we found ourselves having during many of our own major incidents: it doesn’t matter if it looks impossible; we must find a way to succeed.

David and I believe that the incredible combination of expectations, difficulty, and risk associated with each outage warrants the guidance we’ve put forward in this series. We’ve learned these lessons the hard way over the years and hope that by sharing them, readers can gain the benefits without the pain.

This, then, is the greatest gift we can give back to our colleagues: our experiences, the lessons learned, and the axioms we’ve developed from our 45 combined years of experience as Operations leaders.

Implementing a culture of reliability

When looking to apply these lessons yourself, there is some good news: you can start anywhere. For the most part, each of the axioms stands on its own, without relying on the others for merit. That said, I highly recommend starting with what gets measured gets fixed. Once you understand the problems, you can begin fixing them.

Whether you are the senior member of your team or new to the industry, these axioms will serve you well. Your tenure experience, or years of service do not matter. By applying these 10 lessons, teams can create a culture of reliability backed by engineers who can think on their feet, are empowered to make the difficult calls, and who can scale their group farther than you can imagine.

Conclusion

Understanding the 10 axioms we’ve presented is simple; implementing them can be a bit more challenging. This isn’t rocket science, but every day is Monday in Operations, and that means you have to overcome constant changes. You are only as good as your lieutenants, and to make site up your highest priority, you’ll need their help. Also, don’t assume that everyone is on the same page when you start to implement these lessons—you need to communicate, communicate, communicate about the goals and processes you want to use to achieve them.

Operations is a team sport, and if you are at a loss for where to start, remember: what gets measured gets fixed. If you can only measure two things, remember that MTTD and MTTR are key. To improve how you handle the site incidents you do discover after measuring, you must attack the problem, not the person. When it’s time to attack the problem and get to the source, look to the code, because the code doesn’t lie.

Finally: do your best and never give up.

To ask us questions about this post or the entire “Every Day is Monday” series, please join the conversation here. We’ll be checking in on the comments throughout business hours, so let us know what you think!

Topics: Developer Experience/Productivity Infrastructure