Open Sourcing iris-message-processor
August 30, 2023
One measure of a successful network is uptime - providing consistent, reliable service for members and customers. If there are frequent connection errors or downtime notifications, it becomes difficult to deliver an experience where people can connect and interact with ease. When faced with uptime challenges, being able to quickly escalate issues to network engineers helps ensure that people can work the way that they want to.
At LinkedIn, escalations encompass various events, including alerts, system change notifications, and automated actions that require an engineer's acknowledgment to proceed. These events follow a customizable escalation plan that generates notifications (often with increasing urgency) until an engineer claims the event or the needed steps are completed.
To manage our on-call escalations, we built Iris and Oncall, two open-source tools that we introduced to the community approximately six years ago. Oncall enables our teams to efficiently handle their on-call shifts through automated scheduling and a suite of calendar management tools. Iris leverages the data provided by Oncall to promptly alert on-call engineers in case of any issues and escalate matters if required. Developers have the ability to create personalized escalation plans and message templates, granting them control over who receives alerts and the specific content delivered in those alerts. Because of its ease of use and flexibility, Iris has also become LinkedIn's internal message delivery platform, sending out alerts, deployment notifications, security notices, and more.
Together, these tools deliver flexibility, customization, and simplicity in managing on-call escalations and can be used as low-cost replacements for off-the-shelf incident response platforms like PagerDuty. Currently, Iris and Oncall have more than 350 forks and 1,700 stars on Github.
In this post, we’ll discuss how we used Iris to both scale up (~2000%) and speed up (~86x) our incident management system. We’ll also share how this journey resulted in an incredibly robust and effective system that we open-sourced, allowing it to be easily and freely deployed by any company.
Growth of Iris within LinkedIn
When we first open sourced Iris, we published this blog post that included the following graph showing the month-by-month growth of Iris escalations at LinkedIn. At the time, adoption of Iris had steadily been growing and Iris was integrated into the major components of our alerting infrastructure.
Figure 1: Graph showing monthly Iris escalations 2015-2017
Six years later that graph looks very different. Iris has become a ubiquitous service at LinkedIn with a 700% increase in the number of other services directly integrated via its API. At the same time, the scope and complexity of LinkedIn’s engineering footprint has grown massively. As a result, Iris experienced a 2,300% growth in the number of escalations Iris now processes monthly.
Figure 2: Graph showing monthly Iris escalations in 2023
The difference is greater still when we look at messages sent by Iris. The role expansion of Iris as a generalized message sending API made its usage skyrocket. Iris currently sends, on average, over 700,000 messages daily with bursts of more than 3,000 messages per second!
Figure 3: Graph showing daily Iris messages June 2023
At this scale we were starting to run into the limits within Iris’ original design and we recognized that changes were needed to continue meeting our very strict reliability and performance standards.
Designing for scale
When it became clear that the existing design would not be able to continue serving our needs, we embarked on a project to re-architect Iris into a service that could scale across the next “10x” of growth at LinkedIn, minimizing the need for future redesigns. To do so, we first had to identify what exactly were some of the most pressing issues with the current design.
One of the most crucial issues that we wanted to address was the way Iris-api handles message processing and escalations. Previously, Iris-api relied on an iris-sender python subprocess running on a single leader node that would ingest all the escalations from the database. It would then evaluate and render each message one-by-one before passing it off to other senders in the cluster for sending. Because it processed escalations and rendered messages serially, any sizable enough burst of escalations could cause massive delays, up to tens of minutes in the worst cases. Failures of the leader also had the potential to cause outsized impact, if the sender process got stuck all escalation processing would completely grind to a halt.
Additionally, the iris-sender design relied on having strong consistency in its database which we achieved by using a Galera cluster. However, as the volume of escalations and especially messages grew dramatically, the demands started outpacing the capabilities of the system. The single leader iris-sender subprocess used the database as a message queue, which was becoming more untenable as the volume of messages grew – especially since both the Iris-api and the Galera clusters were all geographically distributed across three different data centers. This resulted in several issues, mainly the occurrence of Galera replication deadlocks that caused requests to fail intermittently.
To address all these issues we created a new service written in Go called iris-message-processor. iris-message-processor is a fully distributed replacement for the iris-sender subprocess. With this new service, Iris escalations are split up into buckets which are dynamically assigned/reassigned to different iris-message-processor nodes as they join or leave the cluster. In turn, these iris-message-processor nodes concurrently process their escalations and messages assigned to them as well as directly send them out to the appropriate vendors. This means that instead of relying on a single sender leader, the iris-message-processor cluster can now be horizontally scaled with virtually no limits to accommodate ever expanding escalation or message volume. Additionally, because the database is no longer used as a message queue, the demands on the existing cluster are much lower and a separate easier to scale eventually consistent database can be used to store the resulting messages for tracking.
An additional bonus of doing a major rewrite is that it gave us the opportunity to tackle some other long standing tech debt and add features to improve reliability and performance. These include a global per mode rate limit to prevent rate limiting by vendors like Slack; Per application per mode round robin distribution of message rate limits so that we could accept large concurrent volumes of messages from a specific client without degrading another client’s experience; Prioritization of critical alerting escalations over “out-of-band notifications”; Better introspection into the escalation and message queue, and more.
Figure 4. Architecture diagram of the new Iris ecosystem
The performance gains we were able to achieve with the iris-message-processor were a great success. To test the effectiveness of our changes we load tested Iris escalation processing before and after switching to Iris-message processor and measured the time between an escalation creation request being received and the delivery of that escalation’s messages to the appropriate vendors. We found that:
Under an average load (100 escalations per minute) iris-message-processor was ~4.6x faster than iris-sender
Under a high load (2,000 escalations per minute) iris-message-processor was ~86.6x faster than iris-sender
We also recreated a previous outage scenario where a burst of 6,000 simultaneous escalations created a massive slowdown in Iris escalation processing that took almost 30 minutes to clear as shown in Figure 5:
Figure 5: Graphs detailing old iris-sender processing ~6,000 escalation burst
When submitted to the same load, the iris-message-processor took less than 10 seconds to process all the escalations.
Figure 6: Graphs detailing new iris-message-processor processing ~6,000 escalation burst
We even tested stopping 50% of the iris-message-processor nodes simultaneously to test the automatic rebalancing of escalation buckets to great success. The whole cluster automatically rebalanced in less than 30 seconds and escalation time (the time it takes to process all currently active escalations) stayed under three seconds even under an above average load.
Figure 7: Graph detailing iris-message-processor rebalancing its nodes after dropping 50% of member nodes
A system can be tested very thoroughly but what ultimately matters is how it performs in a real world production scenario. At the time of writing we have now been running on iris-message-processor at LinkedIn for about a year with no outages and consistently beating SLOs of 1000ms/msg. In that time we have encountered many cases which would have caused issues with the previous iris-sender but so far the iris-message-processor has proven to be a worthwhile investment.
How to use iris-message-processor
Iris was and continues to be one of the many critical pieces of infrastructure at LinkedIn. Because of that, we wanted to make our changes as transparent as possible to anyone using or interfacing with Iris. We achieved that by keeping the existing Iris-api as is and having an iris-message-processor just act as a drop-in replacement for the iris-sender subprocess. By doing so we can maintain the same UI and API interfaces while delivering performance gains under the hood.
Additionally we took extra measures to ensure the stability of the platform during the rollout. This included preserving backwards compatibility with the iris-sender script so it was easy to toggle back to using iris-sender in case of an outage as well as designing the system with the capability to gradually ramp a percentage of messages and escalations for verification and testing purposes. At this point the iris-message processor has been thoroughly battle tested so it is safe to fully move to processing all messages and escalations with the iris-message-processor directly.
The iris-message-processor can be installed and run separately and through just a few simple configuration changes on the Iris-api side they will seamlessly talk to each other through the existing Iris-api. Full instructions and code can be found in the iris-message-processor repo.
We recently released iris-message-processor as open source software here to join the existing Iris and Oncall repos. These projects are meant to be able to operate outside of LinkedIn’s internal environment and serve as complete replacements for other existing off the shelf incident management systems within any environment.
We welcome any potential users or contributors. You can also reach our team with general questions about the project by opening an issue in the GitHub repository or emailing email@example.com.
The iris-message-processor project was made possible through the efforts of Diego Cepeda, Joe Gillotti, James Won, Colin Yang, and Fellyn Silliman. Additional thanks to Sam Moffatt, Kahnan Patel, and Michael Herstine for their invaluable support.