SRE Articles

  • Saira joined our Bangalore site reliability engineering (SRE) team to tackle large-scale, site engineering challenges and grow. She highlights for us the impactful work she found here — from ushering in LinkedIn’s next-generation, server query system that runs over a fleet of 350,000 servers, to mentoring the next generation of female engineers: In my...

  • Operating system upgrades at LinkedIn’s scale

    August 31, 2022

    Co-authors: Hengyang Hu, Dinesh Dhakal, Kalyanasundaram Somasundaram Introduction Completing recurring operating system (OS) upgrades on time and without impacting users can be challenging. For LinkedIn, completing these upgrades at a massive scale has its own complexities as we’re often facing multiple upgrades. To secure our platform and protect our members’...

  • diagram-of-alert-correlation-high-level-architecture

    Spike detection in Alert Correlation

    December 22, 2021

    Introduction LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed...

  • host-wise-latency-to-detect-outliers-and-single-node-failures-this-graph-shows-four-outliers-from-three-hosts

    Rethinking site capacity projections with Capacity Analyzer

    March 16, 2021

    While site outages are inevitable, it’s our job to minimize both the duration of outages and the likelihood for an outage to occur....

  • school-of-sre-logo-showing-a-gear-wearing-a-graduation-cap

    Open source update: School of SRE

    February 3, 2021

    Co-authors: Akbar KM and Kalyanasundaram Somasundaram Site up and secure is a fundamental element of how we operate, and site...

  • fixing-linux-file-system-performance-regressions

    Fixing Linux filesystem performance regressions

    October 16, 2020

    As companies grow, adapt, morph, and mature, one item remains the same: the need for reinvention. Technical infrastructure is no...