SRE Articles

  • SREculture2

    Building the SRE Culture at LinkedIn

    May 15, 2017

    Co-authors: Bruno Connelly and Bhaskaran Devaraj   Being a Site Reliability Engineer (SRE) means having to talk about hard problems. Site outages, complex failure scenarios, and other technical emergencies are the things we have to be prepared to deal with every day. When we’re not dealing with problems, we’re discussing them. We regularly perform post-mortems...

  • Trafficshift2

    TrafficShift: Load Testing at Scale

    May 11, 2017

    Co-authors: Anil Mallapur and Michael Kehoe   LinkedIn started as a professional networking service in 2003, serving user requests out of single data center. For any internet services company, availability is a key factor in its success. In any internet architecture, a lot of things can go wrong at any given time; network links can die, power fluctuations can...

  • EveryDay1

    Failure is Not an Option

    January 16, 2017

    This is the final post of the series “Every Day Is Monday in Operations.” Throughout this series we’ve discussed our challenges, shared our war stories, and walked through the learnings we’ve gained as Operations leaders. You can read the introduction and find links to the rest of the series here. If Operations fails, so does your company—it is, as you might say...

  • Everyday1

    MTTD and MTTR Are Key

    December 12, 2016

    This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war...

  • everyday1

    What Gets Measured Gets Fixed

    December 5, 2016

    This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war...

  • EveryDay1

    Every Day Is Monday In Operations

    November 30, 2016

    This post is part of the series “Every Day Is Monday in Operations.” Throughout this series we discuss our challenges, share our war...