Co-authors: Erik Krogen and Min Shen In March 2015, LinkedIn’s Big Data Platform team experienced a crisis. As the team was preparing to head home for the day, signs of trouble began trickling in: our internal users were reporting that their applications were stalling or timing out. Job queues were backing up, and SLAs would be missed. A bit of investigation...