SRE Articles

  • Metal as a Service (MaaS): DIY server-management at scale

    May 11, 2023

    Guaranteeing that our servers are continually upgraded to secure and vetted operating systems is one major step that we take to ensure our members and customers can access LinkedIn to look for new roles, access new learning programs, or exchange knowledge with other professionals. LinkedIn has quite a large fleet of servers on-premise that depend on internal...

  • Saira joined our Bangalore site reliability engineering (SRE) team to tackle large-scale, site engineering challenges and grow. She highlights for us the impactful work she found here — from ushering in LinkedIn’s next-generation, server query system that runs over a fleet of 350,000 servers, to mentoring the next generation of female engineers: In my...

  • Operating system upgrades at LinkedIn’s scale

    August 31, 2022

    Co-authors: Hengyang Hu, Dinesh Dhakal, Kalyanasundaram Somasundaram Introduction Completing recurring operating system (OS) upgrades on time and without impacting users can be challenging. For LinkedIn, completing these upgrades at a massive scale has its own complexities as we’re often facing multiple upgrades. To secure our platform and protect our members’...

  • diagram-of-alert-correlation-high-level-architecture

    Spike detection in Alert Correlation

    December 22, 2021

    Introduction LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them....

  • host-wise-latency-to-detect-outliers-and-single-node-failures-this-graph-shows-four-outliers-from-three-hosts

    Rethinking site capacity projections with Capacity Analyzer

    March 16, 2021

    While site outages are inevitable, it’s our job to minimize both the duration of outages and the likelihood for an outage to occur....

  • school-of-sre-logo-showing-a-gear-wearing-a-graduation-cap

    Open source update: School of SRE

    February 3, 2021

    Co-authors: Akbar KM and Kalyanasundaram Somasundaram Site up and secure is a fundamental element of how we operate, and site...