Iris mobile: An open source, mobile interface for incident management

Daniel Wang

Site Reliability Engineer at LinkedIn

May 9, 2019

At LinkedIn, our on-call incidents are managed using Iris and Oncall, two tools that we released as open source to the community about two years ago. Oncall allows our teams to manage their on-call shifts in a largely automated fashion, scheduling rotations without any human intervention. At the same time, it allows teams to be agile and adaptable when defining who is on-call by providing mechanisms to substitute, swap, and edit shifts when needed. Iris then uses the information provided by Oncall to reach out to on-call engineers when something goes wrong, and escalate issues if necessary. Users can define custom escalation plans and message templates, and have the power to control not only who is alerted, but also what content is delivered in their alerts.

These two tools offer flexibility, customizability, and simplicity in managing on-call incident escalation. However, when we open-sourced Iris and Oncall, we still had one big gap to be filled: a mobile interface. When engineers are away from the workplace and away from their desktop towers, it can be cumbersome and annoying to have to open a desktop site for on-call incident management. Today, we’re excited to announce that we’re closing that gap by releasing an open source mobile app for Iris, available for both iOS and Android devices. We’re glad to share this improved user experience with the broader community, and we hope that the new mobile interface will make more teams consider Iris and Oncall as their on-call escalation option.

A brief overview

The Iris mobile app focuses on incidents, providing users with a list of recent alerts in a way that’s analogous to browsing through email.

Incidents can be tapped on to navigate to a more detailed view of the alert. The content shown in this view is easily customizable via configured settings in Iris.

Here, we see alert details and graph images that are specific to the use cases we have at LinkedIn. Iris relies on user-defined templates to determine this layout. Layouts can and will differ for different alerting methods. This means that each application that raises incidents through Iris can define its own detailed layout to best display what is important to the end user.

Hybrid platforms

When developing this app, we evaluated a number of different approaches, such as native implementation and several different hybrid platforms. In the end, we settled on the Ionic platform, which provided a modern framework for development and allowed us to leverage our existing knowledge. To make this decision, we first needed to analyze the tradeoffs between native and hybrid app development. The Iris team is composed of only a few dedicated engineers, and none of us had very much experience working with native mobile applications. With limited resources, we didn’t want to dedicate too much time to maintain two separate applications. One of the biggest advantages of a hybrid approach is code reusability; by following this approach, we had the bonus capability of creating two apps using almost the same codebase.

Iris prides itself in its flexibility: it provides an interface to render custom templates for all applications that integrate with it. These templates allow applications to define the layout of incidents on the web interface, using Jinja and Handlebars. Since these services are already generating HTML, our template design naturally lends itself to a hybrid platform. This gives us a more seamless experience for developers working on additional application integrations and enables them to write templates for mobile in a way that is very similar to writing templates for web browsers.

Having settled on a hybrid approach, we did some analysis on two of the largest players in this space: Ionic and React Native. These platforms have differing philosophies: React Native describes theirs as “learn once, write anywhere,” while Ionic prefers “write once, run anywhere.” One of the greatest advantages of React Native is that it allows its users to leverage existing React knowledge. Our team lacked this experience, however, meaning that we would need to ramp up not only on React Native, but also React. On the other hand, the team was already comfortable with the Ionic framework and other hybrid platforms like it, greatly reducing the learning curve required. Pairing that with our small development team, Ionic’s approach of “write once, run anywhere” was very appealing, and helped us get off the ground much faster.

One of the primary advantages of React Native is performance; a React Native app is able to communicate directly with platform-specific APIs, while most other hybrid approaches must work with the overhead of a webview. However, in our experience, we’ve found that the applications generated by Ionic have been limited primarily by the speed of the network and the resulting speed of the API calls that it makes, not the framework itself. Non-native feel is another possible drawback, but we’ve found that the components provided by Ionic do a good job of reproducing native design idioms, though these occasionally need platform-specific configuration.

Managing authentication

Iris relies on external authentication mechanisms to manage user accounts—usually via an organization’s Active Directory. Exposing this internal system to the internet creates a large security risk, so we rely on single sign-on (SSO) providers for mobile app authentication. At LinkedIn, we leverage a third-party service for this purpose, setting up a SAML flow that can verify a user’s identity even off premises. This way, our threat surface is reduced to an entity that is already thoroughly analyzed. Iris is not authenticating users’ identities, and instead leaves that job to another, more focused piece of our infrastructure. This way, we are able to allow our users to authenticate away from the office in a way that upholds our high standards of security.

Iris API is hosted on an internal network, so all communication with it is handled through Iris relay, our outward-facing proxy that handles third-party integrations requiring bi-directional communication. This is again tightly controlled, with a limited surface for attack. Within the app, we leverage the in-app browser to handle the SAML flow, which redirects users to our SSO provider’s sign-in page. After successful authentication with the SSO provider, we provide users with a session key used for authenticating with Iris relay.

Providing dynamic graph content off-network

Our internal metrics visualization tools provide “graph snapshots” as images, which we can pass along to the user to provide instant visual feedback corresponding to the issue at hand. To facilitate this, we pass along an image URL with our alerting incidents that we then include along with the other incident context. However, a problem arises here: the URL where the image is hosted is only available internally, requiring a pass through the Iris relay. In addition, users often find it useful to see not only the graph at the time of the alert, but also the graph at the current time.

In order to facilitate this dynamic graph content, we’ve built a custom component that can be included in templates similar to how one might display an Ionic component. This graph block triggers a call to Iris relay to fetch image data for graph data at the original alert time, as well as the current time. The logic of handling user input while switching tabs is delegated to the backend, and the template writer only needs to include the basic markup.

To display incident data, the Iris mobile app uses a templating scheme based on Handlebars, similar to the Iris UI. For example, this is the template we use for our internal alerting platform:

This results in the following sample user display:

In our app, we define custom CSS classes that allow users to display data in a human-readable format without needing to write CSS themselves. If needed, however, almost all aspects of the incident detail page can be controlled by users by just writing HTML. We also provide a sensible default incident page for applications that have not defined a custom layout. You’ll also notice the use of a “<graph-block>” tag, which isn’t found in standard HTML. This is our custom Ionic component and is the way we control graph input. Our templating engine uses the attributes of the graph-block pseudo-tag to determine how to get this information, and then displays it to our users.

While this is currently the only custom component available in our templating, it’s possible that other use cases requiring more dynamic interactions will arise. We hope to use a similar strategy in these cases: maintaining user templating and customizability without sacrificing power.

What’s next?

This app focuses primarily on providing a mobile frontend for incident escalations via Iris. While such escalations represent the largest part of the on-call experience for most engineers, we’re still missing a mobile experience for on-call shift management and scheduling. Adding an interface to Oncall, in addition to Iris, will be the next step. Beyond that, we’re always looking for ways to improve our on-call experience. We hope that this app represents a solid starting point for our mobile experience, and we’re excited to see how it moves forward and evolves.

Contributing

The Github repository for this app can be found at https://github.com/linkedin/iris-mobile. Contributions are welcome and encouraged. Please direct any questions or feedback to iris-oncall@linkedin.com, and we’ll be happy to help however we can.

Acknowledgements

The Iris mobile app was made possible through the efforts of Stephen Collier, Saif Ebrahim, and Diego Cepeda. Additional thanks to Kyle Johnson and the Monitoring Infrastructure team at LinkedIn.

Topics: Developer Experience/Productivity Open Source Product Design