Open Sourcing Iris and Oncall
June 29, 2017
At a company as large as LinkedIn, service degradation isn’t a question of “if” so much as “when,” and when things do break, we need to escalate as quickly as possible to make sure the problem gets fixed. This usually takes the form of calling up an on-call engineer, but what if this person doesn’t answer the phone? In the past, LinkedIn addressed this question manually, with NOC engineers escalating incidents to on-call team members. As one might imagine, this was a very hands-on process with a lot of ambiguity. The NOC found it difficult to determine who to contact as a secondary point of escalation and how to get in touch with them. To make matters worse, the number of alerts at LinkedIn was growing extremely quickly (as it continues to do today), and the relatively small number of engineers in our NOC couldn’t possibly keep up with the skyrocketing number of alerts.
Example of alerts vs. NOC scaling
Rather than continuing to rely on this manual process, in the summer of 2015 we decided to create a new, automated system. Iris, named after the Greek goddess of messages, is our open-sourced solution for incident escalation and reliable messaging, and has provided LinkedIn with fast, automated escalations for almost two years now. Iris solves the problem of ambiguity by allowing its users to specifically define an escalation plan that it will automatically follow in the event of an incident. Here’s an example of an Iris escalation plan:
Plans are defined with a number of steps. If Iris doesn’t hear back from the user after executing a step, it will proceed to the next step in the plan, continuing to do so until someone responds or the plan has no further steps defined. Focusing in on step one, we see two cards that define the messages sent in this step. Here, we send a medium priority message to the whole Monitoring Infrastructure team, along with a high-priority message to the primary on-call. These will be sent in parallel, with either the team or the on-call engineer able to acknowledge the incident. The plan’s notifications can be configured to repeat, with a waiting period between repetitions; in this example, the high-priority message on the right is repeated twice, with five minutes between messages.
One of the more surprising features of an Iris plan is that no specific modes of contact are specified anywhere. Instead, we define priorities, from low to urgent, and allow users to map contact modes to these priorities. This gives our users a high level of flexibility in defining how they wish to be contacted. One person may prefer to receive Slack messages rather than email, while another may opt for SMS, and Iris is fully able to support these preferences.
Architecture and design
The diagram above shows the general architecture of Iris. To make better sense of it, let’s track the lifetime of an incident through the Iris pipeline. Beginning at the top, in green, an application triggers an incident by sending a POST request to Iris’s REST API, which tracks the incident in its database. Then, the Iris sender uses this incident data to generate messages according to the incident’s escalation plan, forwarding the notifications to external messaging vendors such as Twilio or Slack for delivery. A user then receives the message and responds to it to claim the incident, either by using the Iris frontend or by sending a reply to the vendor. If a claim is processed through the vendor, an additional trip through the Iris relay is needed to provide access to Iris’ internals through our production firewalls. Finally, the API receives the user’s request to claim the new incident, marking the incident as acknowledged. After the incident has been claimed, Iris’ job is done; it has guaranteed successful message delivery and confirmed that someone is responding to messages, so it ceases to escalate further.
One thing to highlight in Iris’ design is its modularity. In its architecture, everything external to Iris is pluggable, general, and abstract. Though LinkedIn currently uses Twilio for much of our message delivery, we have also designed Iris to be generic and independent of its external applications. Iris stands alone as a central, reliable messaging hub, working alongside our metrics pipeline, rather than within it. Originally, we attempted to create a service that was tightly woven into this pipeline, but this approach proved to be far too brittle, with too many assumptions made about use cases. Iris solves these issues with abstraction, remaining flexible and adaptable.
One example of this flexibility lies in the abstraction of roles in Iris plans. In our previous solution, escalation paths were rigidly defined, based on manager information in LDAP. However, this meant that incidents handled by senior-level technical staff would directly escalate to directors and VPs! Rather than specifically defining users to escalate to, Iris supports custom roles, with pluggable methods of role lookups. Currently, LinkedIn makes heavy use of the “team,” “on-call,” and “manager” roles, each of which is determined dynamically from an outside source of truth. This gives Iris a high level of flexibility in determining whom to contact. Iris itself doesn’t need to track these details; it can instead rely on another service to provide these details and focus purely on message delivery.
Oncall, Iris’ source of truth
To provide Iris with this source of truth for determining who is on-call for a given team, we introduced another product: Oncall. Oncall allows managers to define rotating schedules for on-call shifts, and provides a calendar for viewing and changing these shifts as needed. The image below gives us a good look at what Oncall has to offer:
Oncall comes with built-in support for follow-the-sun schedules, and provides a clean UI for swapping, editing, and deleting events. It supports a number of different event types, and has built-in shortcuts for overriding an existing shift, should the need arise for a substitution. It acts as a specialized calendar of sorts, making management of on-call schedules fast, clean, and painless.
One of the advantages of having Oncall as a separate service is the ability to provide teams with an on-call scheduling tool without necessarily tying in escalation. For example, many IT teams at LinkedIn don’t own critical applications, but instead use Oncall to define a first point of contact for the team. In these cases, Iris’ escalation plan definitions aren’t needed, and teams can use Oncall alone as a specialized calendar.
Operating Iris and Oncall at scale
Automation and tuning
Of course, building these solutions to on-call management was only half the problem; operating these tools came with its own challenges as well. In Iris’ early lifetime, one of the first difficulties we faced was a huge influx of incoming incidents resulting from a noisy alerting system. Previously, since the escalation process was entirely manual, the rate of outgoing messages was limited to the rate that humans could send them. If many alerts fired at the same time, NOC engineers would still deliver only one call, acting as a sort of manual aggregator for alerts. Naively migrating all alerts to immediately trigger an incident with Iris resulted in the removal of this “manual aggregator,” and correlated alerts firing at the same time became a huge issue for early adopters.
We introduced a number of features to help our users handle these alert storms, adding some automatic batching. In addition, we introduced a mechanism to automatically lower the priority of incoming messages if too many were sent to a particular user in a given time window. Ultimately, though, we found that the only true solution was to resolve the underlying problem of noisy alerting. By grouping and tuning alerts, our users were able to cut down on noise and keep Iris incidents to a manageable level. In this case, we found that a properly-configured alerting system was critical in ensuring reliability, especially when automation is thrown into the mix. Iris didn’t create any problems in and of itself; it instead highlighted existing issues that manual escalation was able to sweep underneath the rug. At the end of the day, our alerting system became not only more reliable with Iris, but also less noisy.
Since then, Iris has grown quickly to become LinkedIn’s primary incident escalation mechanism, handling hundreds of incidents a day.
Over the year and a half that we’ve been using Iris, we’ve onboarded many new services, and consequently have seen an explosion in the number of incidents it handles. However, with that growth comes the responsibility of guaranteeing reliable escalation for our users. Because of Iris’ unique place in LinkedIn’s monitoring and alerting stack, a single small misstep can result in a huge impact to the entire company. Absolute reliability is incredibly important, since so much of LinkedIn’s monitoring is routed through Iris.
Given this position, perhaps it is surprising that Iris itself has experienced only one major outage in its lifetime at LinkedIn. Though no system is perfect, Iris is remarkably reliable, and its stability is in large part due to one of our core design principles: keeping Iris simple. Iris actually has a very small number of moving parts; message delivery is abstracted away by Twilio and other messaging vendors, and alerting is controlled by outside triggers. This means that Iris only concerns itself with ensuring that incidents are acknowledged. Though it has additional quality-of-life features to make incident acknowledgement easier, at its heart, Iris is just a messenger. Limiting the scope of Iris to delivering reliable messages has allowed it to become a focused, elegant, resilient service that is a cornerstone of our alerting system today.
Culture of contribution
Another key influence on the internal development history of Iris and Oncall at LinkedIn was the contributions from other teams to both of these projects. Much like an external open source project, the positive reputation of both Iris and Oncall led to many teams wanting to extend these features for new use cases. This creates a virtuous cycle where the projects become more applicable for more users, and therefore attract more contributions as a result.
Future plans and development
In designing and developing Iris, we decided to build our own escalation system partially based on cost, but also based on the advantages of being able to tailor the system to our own specific use cases. In addition, we found the incident escalation and on-call management domains to be mostly unfilled in the open source community, and we’re happy to fill in the gaps by presenting Iris and Oncall.
By providing Iris and Oncall as open-sourced products, we can offer the community a production-ready escalation system that is free, open, and growing. We have lots of plans in store for these products, ranging from making Iris’ sender more reliable to improving UX in dealing with Oncall’s automatic scheduler. Code and documentation for these products can be found at https://iris.claims and https://oncall.tools, and we welcome any potential users or contributors. You can also reach our team with general questions about either project by emailing firstname.lastname@example.org. Iris’ and Oncall’s stories are just beginning, and we’re excited to see our products evolve and grow into a world-class incident management system.
Iris and Oncall were created from the hard work of Wen Cui, Saif Ebrahim, Joe Gillotti, Qingping Hou, Fellyn Silliman, Jessi Reel, and Daniel Wang. Thanks goes to Richard Waid and the entire Monitoring Infrastructure team at LinkedIn, as well as the SRE organization as a whole.