Introduction LinkedIn’s stack consists of thousands of different microservices and the associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed...

Posts by Nishant Singh
-
- Topics:
- SRE
-
Eliminating toil with fully automated load testing
Nishant Singh December 6, 2019
Introduction In 2013, when LinkedIn moved to multiple data centers across the globe, we needed a way to redirect traffic from one data center to another in order to mitigate potential member impact in the event of a disturbance to our services. This need led to the birth of one of the most important pieces of engineering at LinkedIn, called TrafficShift. It...
- Topics:
- data center,
- Automation,
- SRE