Resilience Articles

  • multicluster1

    Improving Resiliency and Stability of a Large-scale Monolithic API Service

    November 28, 2017

    Co-authors: Maulin Patel, Erek Gokturk, and Chris Stufflebeam How do you increase the resiliency and stability of a monolithic API service that is used by three different platforms, serving 500+ million members, developed by over 400 engineers, deployed three times per day, and consuming almost 300 downstream services? The API layer service used by LinkedIn.com...

  • Waterbear-logo

    Resilience Engineering at LinkedIn with Project Waterbear

    November 10, 2017

    Coauthors:  Bhaskaran Devaraj and Xiao Li   Over the last several years, many companies have discussed ways to improve the resiliency of their services and infrastructure. Many projects, like Netflix’s Simian Army, have spawned open source projects that have been adopted by other companies. Other discussions about resilience engineering focus on cultural and...

  • Redliner Dependency Components

    Redliner: How LinkedIn Determines the Capacity Limits of Its Services

    February 17, 2017

    Co-authors: Susie Xia and Anant Rao LinkedIn serves more than 467 million members on a global computing infrastructure through hundreds of internal services. During processes such as new feature releases, capacity planning for traffic growth, and data center failover analysis, the following questions are raised frequently: “What is the maximum QPS (queries per...

  • A Deep Dive into Simoorg

    March 28, 2016

    Failure induction is a process of non-functional testing in which a set of failures is induced against a perfectly healthy service....