SRE Articles

  • iris-mobile-view

    Iris mobile: An open source, mobile interface for incident management

    May 9, 2019

    At LinkedIn, our on-call incidents are managed using Iris and Oncall, two tools that we released as open source to the community about two years ago. Oncall allows our teams to manage their on-call shifts in a largely automated fashion, scheduling rotations without any human intervention. At the same time, it allows teams to be agile and adaptable when defining...

  • featured7

    Coding Conversations: The “Perfect Storm" that Brought Down

    November 16, 2018

    Editor’s Note: This article originally appeared as a guest post on VentureBeat titled “What I learned by bringing down” Reprinted here in full, the post tells the story of how Katie accidentally crashed After the immediate problem was resolved, the incident resulted in sitewide technical improvements and turned out to be a growth...

  • linkedout-1

    LinkedOut: A Request-Level Failure Injection Framework

    May 24, 2018

    LinkedIn has made significant investments in resilience engineering over the past few years. As Site Reliability Engineers (SREs), we've consistently witnessed the effects of Murphy's Law: "Anything that can go wrong, will go wrong." In a complex, distributed technology stack, it's important to understand the points where things can go wrong in your system and...

  • open-sourcing-shiv-1

    Introducing and Open Sourcing shiv

    May 10, 2018

    At LinkedIn, we ship hundreds of command-line utilities to every machine in our data centers and to all of our employees’ workstations...

  • feature7

    Evolution of Couchbase at LinkedIn

    May 1, 2018

    Author's note: My colleague, Michael Kehoe, wrote a blog post on the Couchbase Ecosystem at LinkedIn. I encourage you to read it if...

  • gd-sre-teams-pt2-1

    The Makeup of Successful Geographically-Distributed SRE...

    March 27, 2018

    In part one of this series, we discussed some of the key principles to consider when developing geographically distributed (GD) SRE...