SRE Articles

  • Operating system upgrades at LinkedIn’s scale

    August 31, 2022

    Co-authors: Hengyang Hu, Dinesh Dhakal, Kalyanasundaram Somasundaram Introduction Completing recurring operating system (OS) upgrades on time and without impacting users can be challenging. For LinkedIn, completing these upgrades at a massive scale has its own complexities as we’re often facing multiple upgrades. To secure our platform and protect our members’...

  • diagram-of-alert-correlation-high-level-architecture

    Spike detection in Alert Correlation

    December 22, 2021

    Introduction LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed...

  • host-wise-latency-to-detect-outliers-and-single-node-failures-this-graph-shows-four-outliers-from-three-hosts

    Rethinking site capacity projections with Capacity Analyzer

    March 16, 2021

    While site outages are inevitable, it’s our job to minimize both the duration of outages and the likelihood for an outage to occur. One of our preemptive measures is in the way we determine overall site capacity and health on an everyday basis, in that we load-test in production. There’s an elegant system to bucket and route members to specific data centers from...

  • school-of-sre-logo-showing-a-gear-wearing-a-graduation-cap

    Open source update: School of SRE

    February 3, 2021

    Co-authors: Akbar KM and Kalyanasundaram Somasundaram Site up and secure is a fundamental element of how we operate, and site...

  • fixing-linux-file-system-performance-regressions

    Fixing Linux filesystem performance regressions

    October 16, 2020

    As companies grow, adapt, morph, and mature, one item remains the same: the need for reinvention. Technical infrastructure is no...

  • The impact of slow NFS on data systems

    June 23, 2020

    Espresso is LinkedIn's defacto NoSQL database solution. It is an online, distributed, fault-tolerant database that powers most of...