Resilience Articles

  • linkedout-1

    LinkedOut: A Request-Level Failure Injection Framework

    May 24, 2018

    LinkedIn has made significant investments in resilience engineering over the past few years. As Site Reliability Engineers (SREs), we've consistently witnessed the effects of Murphy's Law:  "Anything that can go wrong, will go wrong." In a complex, distributed technology stack, it's important to understand the points where things can go wrong in your system and...

  • multicluster1

    Improving Resiliency and Stability of a Large-scale Monolithic API Service

    November 28, 2017

    Co-authors: Maulin Patel, Erek Gokturk, and Chris Stufflebeam How do you increase the resiliency and stability of a monolithic API service that is used by three different platforms, serving 500+ million members, developed by over 400 engineers, deployed three times per day, and consuming almost 300 downstream services? The API layer service used by LinkedIn.com...

  • Waterbear-logo

    Resilience Engineering at LinkedIn with Project Waterbear

    November 10, 2017

    Coauthors:  Bhaskaran Devaraj and Xiao Li   Over the last several years, many companies have discussed ways to improve the resiliency of their services and infrastructure. Many projects, like Netflix’s Simian Army, have spawned open source projects that have been adopted by other companies. Other discussions about resilience engineering focus on cultural and...

  • Redliner Dependency Components

    Redliner: How LinkedIn Determines the Capacity Limits of...

    February 17, 2017

    Co-authors: Susie Xia and Anant Rao LinkedIn serves more than 467 million members on a global computing infrastructure through...

  • A Deep Dive into Simoorg

    March 28, 2016

    Failure induction is a process of non-functional testing in which a set of failures is induced against a perfectly healthy service....