Project STAR*: Streamlining Our On-Call Process
January 10, 2018
Co-authors: Bef Ayenew and Adam Hobson
Consider the following conversation that used to be typical at LinkedIn:
"Folks, we may have an on-call problem this week..."
"Our Android engineer is missing."
"Where are they?"
"They’ve been on leave for two weeks now."
Welcome to the on-call rotation for the LinkedIn flagship mobile app and LinkedIn.com desktop and mobile website, better known as Voyager On-Call within LinkedIn. These on-call engineers are responsible for site-up support, as well as keeping the continuous integration and deployment process healthy and running smoothly. The surreal exchange above captures the more-than-occasional dysfunction that plagued Voyager On-Call in the past, in large part due to the scale and the pace of both the organization and the product. When you have over 600 engineers working on a massive product (10,434 total code commits in Q4 of 2017) and a new on-call rotation is assembled every week, it’s only natural to have your occasional breakdowns and scheduling hiccups.
In early 2017, a small group of our flagship app engineering managers started exploring ways to improve this fragile and often unreliable rostering and scheduling process. As we dug deeper, it became apparent that Voyager On-Call had room for improvement in several other areas as well. What started out as an effort to improve just one key aspect of Voyager On-Call soon snowballed into a broader revamp of the entire Voyager On-Call process, with a number of different objectives, including on-call effectiveness and engineer/manager happiness. This is a story about Project STAR* (more on the name later) and our modest efforts to make on-call effective at scale.
Before we dive into the details of the project, let’s first present a quick overview of the Voyager On-Call landscape prior to our changes. Voyager is the internal name of the LinkedIn flagship app, and this app is available on web, Android, and iOS. Each one of these platforms and the frontend API has a weekly on-call engineer who is pulled from a list of eligible Voyager engineers working on that platform. Finally, the on-call team also has members representing our Flagship Productivity organization, Site Reliability Engineering (SRE) organization, and a manager responsible for leading and coordinating that particular on-call rotation.
As with any good project, the first thing we had to do was clearly articulate the problem statement. It was obvious that there were several different problem areas within Voyager On-Call, so we sat down and came up with four high-level buckets that covered the most serious issues facing On-Call at the time:
- Tools and technical debt
- Roles and responsibilities
We came up with the name Project STAR* as an acronym based on the four problem buckets.
With our four problem statements defined, we were now ready to start looking for lasting solutions, as we had no interest in turning this project into a perpetual side-gig. We sought solutions that could be automated and self sustaining. Since we were just volunteers without edicts or mandates, we needed our solutions to get buy-in from all levels of stakeholders, including senior leadership, front-line managers, and the engineers.
We rolled out most of our strategic changes on an ongoing, iterative basis. As is best practice in software development, we wanted to be agile and responsive to user feedback when making changes and improvements to the Voyager On-Call process. We kicked off our project with a “road show” to the various Voyager platform meetups to socialize our project and garner feedback on our proposed changes.
In order to keep Project STAR* accountable to its mission and goals, we employed a weekly survey to Voyager On-Call teams and managers. Some of the questions were based on a similar survey from 2016 so we could do direct comparisons, but ultimately, the purpose of the survey was to help us measure our progress against the success criteria we had set for ourselves.
While attempting to quantify the results, we learned that we were missing some key data points and implemented a second survey targeted at the Voyager On-Call managers. The purpose of this survey was to supplement the first survey and to address some blind spots we had around measuring accountability. Between the two surveys, we felt that we were getting a strong enough signal to allow us to tweak things in all four problem areas and measure impact.
From the early going, it was clear that our scheduling was a major problem area. But a closer look at the problem quickly revealed that it was not just the weekly scheduling that was broken.
The on-call rosters that the weekly schedules were drawn from were also severely flawed. First of all, these rosters were pulled from Voyager ACL files, which did not cover all engineers contributing to the Voyager code base. To make matters worse, the rosters were also overtaxing engineers who appeared in ACLs for multiple code bases. The manager roster was even less reliable, because it relied on managers (or their managers) being aware of the on-call process and adding their names to it. For example, if a manager did not add their own name, they would never be on-call.
As a result, these rosters were at best incomplete and stale; at worst, they were missing a lot of on-call eligible engineers. Again, the root cause here was the scale of the organization, so the challenge was to find a way to keep the rosters and the schedules current as people were moving within or in and out of the organization.
The schedules that were drawn from these rosters were pulled periodically, and if they didn’t go out far enough, it would be hard to find engineers at short notice, since they would already be committed to product engineering work. On the other hand, if the schedules went too far out, gaps would appear from the inevitable churn in eligible personnel. In addition, there was no regular communication when an engineer was added to a roster or scheduled, nor were there reminders when their shift was upcoming. Lastly, engineers did not commit to shifts, so when they would find out at the last minute that they were scheduled for on-call right then, they would occasionally fail to show (for example, if they were on vacation), leaving the on-call manager scrambling to find a replacement.
Our plan was to first define the rules around the roster and scheduling and then automate the entire process so that it would be self-sustaining and would require little to no manual intervention.
For rostering, we decided that any significant contributor to the Voyager code base would be eligible for the on-call roster. For automation purposes, we defined a “significant contributor” as anyone with one or more commits to a Voyager code base within the last month. This window was eventually increased to five or more commits based on early feedback. We also decided that any given engineer was only required to participate in a single roster, even if they contributed to multiple Voyager code bases.
For the management roster, we determined that a manager was eligible for the manager on-call roster if they had one or more direct reports on an engineering on-call roster. Since we were adding dozens of new individuals to the on-call roster who had never participated before, we also created a shadow roster and process that allowed the newly added engineers and managers to shadow an existing on-call shift before joining the primary on-call rosters.
We used the Oncall tool that LinkedIn has open sourced to manage the various on-call schedules. We decided to schedule sixty days out to give an early enough notice, while minimizing time for gaps to appear.
With the major scheduling concerns now addressed, we shifted our focus to proper communication. We decided that the most effective points of communication would be when someone was added to a roster as well as a notice when someone was scheduled to start a rotation four weeks out, with reminders at two weeks and one week.
To ensure that the communication was received, we asked the assigned on-call managers to get commitments at both the four- and two-week reminder mark by taking a roll call. A tool was made available to on-call managers to help find replacements when there were gaps, but because of how early the communication started, the gaps were far less frequent now.
As a final, yet very important step, we automated every phase of the process to ensure that the system mostly ran itself with a little help from the on-call admins and on-call managers, leading to a far more decentralized and scalable on-call process.
The 2016 Voyager On-Call Survey revealed that 36% of on-call participants were dissatisfied with the scheduling process and communication leading up to their rotation. After Project STAR*, dissatisfied participants have dropped to just 8%, an improvement of 78%.
Tooling and technical debt
Surveys conducted in 2016 showed that perhaps the single most important pain point for on-call developers was tooling support. LinkedIn has a fantastic Foundation Team that is responsible for developing and supporting dozens of tools that are being used by LinkedIn teams on a daily basis. However, surfacing and solving tooling issues that appeared during an on-call rotation was a somewhat chaotic process, for reasons that could once again be tracked to the scale of the organization. Our on-call engineers were not clear on how to deal with tooling issues, and this was leading to delays in service deployments and compromising the health of the overall product.
We had a stroke of good fortune on this front, as a dedicated SRE team for Voyager formed at just the right time, allowing us to introduce Voyager SRE as a central part of our new proposal.
We worked with this new Voyager SRE team to add deployment tools support as part of their charter. Additionally, Voyager SRE committed to creating playbooks for deployment tool issues. This document would be used by on-call engineers as the first line of support, protecting the SRE team from fielding inquiries on the same issues rotation after rotation.
However, not all tooling issues could be resolved by the Voyager SRE team. So, based on their charters, we arrived at an arrangement where the SRE team would address deployment tooling issues and the Flagship Productivity team would address build time issues. In both cases, having a single point of contact within Voyager was instrumental in helping the following:
- Assigning clear ownership of tooling issues so there can be continuity across on-call rotations in following up on issues through resolution.
- Enabling proper triaging of flagship tooling issues so the Foundation Team is clear on which ones to prioritize.
- Enabling proper auditing of tooling issues so no duplicate issues are filed with the Foundation Team.
- Making it easier for the Foundation Team to reach out with any questions and suggestions they have on all things flagship-related.
In the end, we settled on a process that had the support of the Foundation Team and, perhaps even more importantly, gave on-call engineers clear and effective options when they ran into tooling issues.
Previously, 16% of Voyager On-Call participants indicated their dissatisfaction with the support they received related to tooling issues. After leveraging Productivity Engineering and SREs in the new process, that number has dropped to 6% for issues with trunk health tooling and 2% for issues with deployment tooling, an improvement of 63% and 88%, respectively.
Of equal importance was the fact that the process improvements we made were also well-received by our partner teams in the Foundation Organization, which is responsible for the trunk and deployment tools. More specifically, they felt that they now had to do less hand-holding with Voyager on-call engineers and also appreciated the drop in duplicate issues being filed against them.
To borrow the words of our former VP of Engineering, “Voyager On-Call is a tax the teams within Voyager have to pay to be part of Voyager and use its vast community resources.” And much like the government, the system can survive and thrive only if teams and individuals hold themselves accountable and honor their commitments.
Although most teams and individuals were meeting the expectations of Voyager On-Call, there were some who were not, in large part because of the misalignment between their team objectives, their personal objectives, and the On-Call objectives. In fact, the 2016 on-call survey showed that a stunning 90+% of on-call engineers did not know if their on-call participation was in any way factored into their annual evaluation, helping highlight this lack of alignment in objectives.
The same survey also revealed that a lot of on-call engineers were unable to be effective because their product engineering commitments were interfering on a week where they were supposed to be 100% committed to on-call. To make a long story short, Voyager On-Call was not getting everyone’s best effort, and we had to find a way to change that without going to the extremes of the IRS on tax evasion.
Voyager On-Call is one of the most important roles in Flagship because on-call is responsible for the overall health of the flagship app. And yet, there was a huge disconnect in the attitude many engineers and managers held towards their on-call duties. This was never more self-evident than in the first feedback we received in our weekly survey. When asked to agree or disagree to the statement, “My regular work did not get in the way of my on-call commitments,” an on-call engineer used the free form response to write, “I put disagree for 'regular work' [because] actually my on-call REALLY got in the way of my regular work.” This was an attitude we needed to fix urgently.
It’s not a surprise that engineers held this attitude. They were reviewed, promoted, and rewarded largely based on their product engineering work. Voyager On-Call was just something they had to deal with for a week once or twice a year. In reality, however, Voyager On-Call is far more important than even the most important project. If the site goes down, even the greatest revenue-doubling project is dead in the water. If the site or mobile apps can’t be released or the trunk of the main code repository is unhealthy (regression tests are failing), all projects are dead in their engineers’ sandbox.
We took a three-pronged approach to improving this attitude and instituting accountability. The first approach was to instill the importance of Voyager On-Call with a roadshow of Project STAR* for all Voyager engineers and managers. We presented not just the proposed changes but the importance of on-call to LinkedIn as well.
Our second approach was to boost accountability via feedback. On-Call participation is a named requirement in LinkedIn’s Career Progression Plan (CPPs), which sets the expectations for a LinkedIn engineer across levels and is used for promotions and annual reviews. However, there was never a good method by which to account for on-call participation. We implemented a simple feedback process, whereby the on-call manager would provide feedback for their on-call engineers to the engineer’s direct manager.
On-call managers were also encouraged to award Bravos, a spot bonus used within LinkedIn, to on-call engineers exceeding expectations. This feedback mechanism would provide the context needed for managers to properly evaluate their engineers’ on-call contributions at promotion and evaluation time. This would provide engineers incentive to perform their on-call rotation well.
The third approach was to recognize that accountability goes both ways. Engineers can only be accountable to meet or exceed the expectations of on-call if the on-call support system met their expectations. We therefore implemented a survey for each on-call rotation, to solicit feedback on how well they were supported during their on-call shift. Specifically, we inquired for feedback on the effectiveness of scheduling and communication, productivity engineering, SREs and the on-call manager. We also included an explicit question asking if regular work was getting in the way of on-call.
With this weekly feedback, we could now ensure that the on-call engineers were receiving the support they needed to perform their jobs as effectively as possible, and if not, then there was accountability to fix the issue and provide the support needed.
The single largest metric improvement we saw was related to accountability. According to the 2016 survey, 84% of Voyager On-Call participants felt their regular feature and product development was getting in the way of their on-call responsibilities during their rotation. That number has now dropped to 9%, an improvement of 89%, with an absolute improvement of 75 percentage points.
In addition, the manager surveys support the substantial increase in accountability, in that no Voyager On-Call has failed to meet expectations since the survey was first conducted.
Roles and responsibilities
We had a fundamental issue around the clear understanding of the roles and the responsibilities of on-calls, often leading to confusion and, on occasion, even dereliction of duty. This problem was in part a function of the sheer volume of loosely organized on-call documentation, which made it a challenge for developers to grok and digest this critical information in a short time. The other reason was simply the absence of documentation to account for the changes the process has seen over time. In short, some of the documentation was far from current.
In defining roles more clearly, our goal was to clarify what the expectations for each on-call engineer were in terms of their general function as part of the on-call team. In clarifying responsibilities, we wanted to highlight different scenarios the on-call engineers may face and explain what the expectations from the on-call engineer would be.
The existing process to pass on knowledge and information on Voyager On-Call roles and responsibilities was broken in multiple places. Some areas were ill-defined or simply not defined at all. When responsibilities were defined, they were mixed together with technical instructions and implementation details in a giant, practically-unreadable wiki page. Lastly, this single wiki page was the only source of official training to onboard new on-call engineers and managers.
We addressed these gaps using two primary approaches: fixing the existing documentation and providing onboarding training to new on-call engineers and managers.
Before we could fix the documentation gaps, we had to fix the information architecture of the Voyager On-Call documentation. This involved explicitly separating the documentation for the roles and responsibilities–the “what”–from the process and technical documentation–the “how to.”
We worked to distill the primary responsibilities into three key areas: (1) help keep the site up, (2) help keep the trunk healthy, and (3) help deploy the application. Even though the technical process and implementation may differ depending on platform, all on-call engineers and managers are responsible for the same three primary areas.
We then clearly set the expectation that on-call was considered the primary responsibility and was expected to consume 100% of an engineer’s capacity for their week of rotation.
Once the information architecture was fixed and the high-level roles and responsibilities were defined, we were able to fill in the remaining documentation gaps, defining communication channels, points of contact on partner teams, and a RAPID process for deciding when to rollback API and web deployments and hotfix or skip Android and iOS releases.
In addition to documentation, we also presented the new high-level roles and responsibilities to each sub-team within Voyager to ensure the information was disseminated to veteran engineers, who may be less likely to reread documentation or retrain.
Lastly, as previously mentioned, we defined a shadow rotation and process to onboard new engineers to Voyager On-Call by providing them with a training shift and pairing them with veteran on-call engineers and managers.
Unfortunately, the 2016 Voyager On-Call Survey did not provide a question to measure feedback on roles and responsibilities, so we are left with no baseline. In the current Project STAR* survey, only 7% of on-call engineers and 0% of on-call managers indicated that they did not understand what they were responsible for. Only 3.5% of respondents indicated that their Voyager On-Call manager was not effective. Since we do not have a baseline for these metrics, we have been following up with these individuals and clarifying areas of improvement in our documentation.
We knew from the start that automation would be a key to Project STAR*, however, even then we underestimated its importance. We initially planned to automate just the roster construction and the email communication to on-call rotation members. As we progressed through the project, we discovered more and more opportunities to automate or create tools to solve pain points. What we couldn’t automate, we documented so future on-call rotations could learn from documentation rather than tribal knowledge.
Our iterative approach and policy of open and transparent communication was validated in spades. We socialized nearly all proposals with key engineering community leaders to solicit and adjust to feedback. We then presented those proposals to the wider engineering community, again seeking feedback. We used not one, but two surveys, to make sure we truly understood how both on-call engineers and on-call managers felt. The final result was all the more successful thanks to the feedback and validation we received along the way. Our constant communication also made everyone feel like they were a critical part of the process, making it more likely for people to accept our proposals.
A lot of people believe that culture change is difficult or even impossible. In the beginning, quite a few engineers felt that being on call was not a core part of their job; we needed to convince them that it was. Our business depended on it. We succeeded in doing that through training and roadshows, and by getting everyone's buy-in. Not only were we able to convince everyone of the importance of being on-call, but after Project STAR*, people actually reported being happier about being on-call.
Project STAR* is an example of one of LinkedIn’s core values, “Relationships matter.” A project this large could never have been as successful without strong collaboration between multiple parties.
Special thanks to Andrew Pottenger and Warren Zhang for helping drive different phases of this project; thanks to Baofa Feng and Vicky Chen for building out the automation and tooling that enabled Project STAR*’s success, Fellyn Silliman for serving as our liaison with the SRE team that owns the Oncall tool and Rashmi Jain and Venkat Sundramarthy for the smooth transition to Productivity ownership at the sunset of the project.
We’d also like to thank Kamini Dandapani for her overall leadership and guidance, Felipe Salum and Isaac Finnegan for their support with aligning the SRE team, Surya Nistala for running the 2016 Voyager On-Call survey and Xin Sun, Félix Pageau and Nicholas Swartzendruber for providing feedback and validation.