Introduction LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed...
SRE Articles
-
- Topics:
- SRE
-
While site outages are inevitable, it’s our job to minimize both the duration of outages and the likelihood for an outage to occur. One of our preemptive measures is in the way we determine overall site capacity and health on an everyday basis, in that we load-test in production. There’s an elegant system to bucket and route members to specific data centers from...
- Topics:
- Performance,
- infrastructure,
- SRE
-
Co-authors: Akbar KM and Kalyanasundaram Somasundaram Site up and secure is a fundamental element of how we operate, and site reliability engineers (SREs) play a critical role in fulfilling that responsibility. Talent has always been the number one operating priority, and over the last few years, we’ve been running multiple programs to identify, hire, and...
- Topics:
- Open Source,
- SRE
-
As companies grow, adapt, morph, and mature, one item remains the same: the need for reinvention. Technical infrastructure is no...
- Topics:
- Performance,
- infrastructure,
- linux,
- SRE
-
Espresso is LinkedIn's defacto NoSQL database solution. It is an online, distributed, fault-tolerant database that powers most of...
- Topics:
- Performance,
- ESPRESSO,
- site speed,
- SRE
-
Co-authors: Viranch Mehta, Jon Sorenson, Samir Jafferali As LinkedIn has grown to more than 690 million members, we’ve expanded our...
- Topics:
- scale,
- infrastructure,
- SRE