Resilience Articles

  • image-of-framework

    Hodor: Detecting and addressing overload in LinkedIn microservices

    February 18, 2022

    LinkedIn launched in its initial form over 18 years ago, which is an eternity in the technology world. The early site was a single monolithic Java web application, and as it gained in popularity and the user base grew, the underlying technology had to adapt in order to support our ever-growing scale. We now operate well over 1,000 separate microservices running...

  • linkedout-1

    LinkedOut: A Request-Level Failure Injection Framework

    May 24, 2018

    LinkedIn has made significant investments in resilience engineering over the past few years. As Site Reliability Engineers (SREs), we've consistently witnessed the effects of Murphy's Law: "Anything that can go wrong, will go wrong." In a complex, distributed technology stack, it's important to understand the points where things can go wrong in your system and...

  • multicluster1

    Improving Resiliency and Stability of a Large-scale Monolithic API Service

    November 28, 2017

    Co-authors: Maulin Patel, Erek Gokturk, and Chris Stufflebeam How do you increase the resiliency and stability of a monolithic API service that is used by three different platforms, serving 500+ million members, developed by over 400 engineers, deployed three times per day, and consuming almost 300 downstream services? The API layer service used by

  • Waterbear-logo

    Resilience Engineering at LinkedIn with Project Waterbear

    November 10, 2017

    Coauthors: Bhaskaran Devaraj and Xiao Li Over the last several years, many companies have discussed ways to improve the resiliency of...

  • Dyno Dependency Components

    Dyno: How LinkedIn Determines the Capacity Limits of Its...

    February 17, 2017

    Co-authors: Susie Xia and Anant Rao Editor's note: This blog has been updated due to the renaming of the project since publication....

  • A Deep Dive into Simoorg

    March 28, 2016

    Failure induction is a process of non-functional testing in which a set of failures is induced against a perfectly healthy service....