LinkedOut: A Request-Level Failure Injection Framework

Logan Rosen

Sr. Staff Software Engineer @ LinkedIn | Observability

May 24, 2018

LinkedIn has made significant investments in resilience engineering over the past few years. As Site Reliability Engineers (SREs), we've consistently witnessed the effects of Murphy's Law: "Anything that can go wrong, will go wrong." In a complex, distributed technology stack, it's important to understand the points where things can go wrong in your system and also know how these failures might manifest themselves to end users.

We wrote about the Waterbear project in late 2017, which is an SRE-led effort to help developers hit resiliency problems head-on by both replicating system failures and adjusting frameworks to handle failures gracefully and transparently. That post gave a brief overview of LinkedOut, our request-level failure injection framework, and this post will dive deeper into all of the components and insights from that project.

High-level overview

There are many ways to inject failures into a distributed system, but the most fine-grained way to do it is at the request level. People depend on LinkedIn to find opportunities, and we don't want to impede this by causing member or guest impact while conducting our testing.

This need for controlled experimentation was the impetus for creating LinkedOut, which, at its core, is a request filter in our Rest.li stack (the open-source portion can be found in the r2-disruptor and restli-disruptor modules). It's currently able to create three types of failures:

Error: The Rest.li framework has several default exceptions thrown when there are communication or data issues with the requested resource. We throw a DisruptException within the filter to mock unavailability of the resource, which bubbles up as a RestliResponseException. You can set an amount of latency to inject before throwing the exception.
Delay: You can pass in an amount of latency, and the filter will delay the request for that much time before passing it on downstream.
Timeout: The filter waits for the timeout period specified in the D2 Zookeeper configuration for that endpoint and then throws a TimeoutException.

Working through these errors at the time of feature development allows engineers the opportunity to validate that their code is robust. Testing in production, however, provides external parties the confidence and evidence of robustness.

We have two primary mechanisms to invoke the disruptor while limiting impact to the member experience. One of these is LiX, our framework for A/B testing and feature gating at LinkedIn. It allows us to target failures on multiple levels, from an individual request for a single member to a percentage of all members for an entire downstream cluster. This was the first triggering mechanism added to LinkedOut, and it allows engineers to set up resilience tests that target specific segments of traffic.

More recently, though, another mechanism was added to inject failures into requests, via the invocation context (IC). The IC is a LinkedIn-specific, internal component of the Rest.li framework that allows keys and values to be passed into requests and propagated to all of the services involved in handling them. We built a new schema for disruption data that can be passed down through the IC, and then failures would instantly happen for that request.

The LiX and IC methods are handy ways to trigger failures, but how do we get actual people inside the company to use them? The answer is building easy-to-use user interfaces that make it simple for anyone at LinkedIn to test the resiliency of their services.

Web application

Developed using our internal Flyer (Flask + Ember) framework and designed using the Art Deco patterns, our LinkedOut web application makes it easy to perform failure tests on a larger level. It provides two main modes of operation: automated testing and feature targeting-based ramping.

Automated testing
As mentioned before, single pages on LinkedIn can depend on several downstreams in order to return the proper data to the member. Due to the velocity of code changes at LinkedIn, which can lead to changes in services' dependency graphs, as well as their abilities to handle downstream failures, we knew we needed a way to allow for automated failure testing. However, there were several questions we had to ask ourselves in designing this feature:

Which user will be making the requests? This is especially important when considering access to paid features on LinkedIn (such as Sales Navigator), if we want to be able to failure test everything.
How do we run these failure tests at scale? Due to the number of downstream services involved in a given LinkedIn page, testing one at a time would take hours.
How do we determine success in an automated failure test? We have several frontends at LinkedIn where a 200 response code doesn't necessarily denote total success, so we needed a different way to determine if we're gracefully degrading.
What is the most effective way to convey automated test results to the user? Users probably would be overwhelmed by raw failure data for every endpoint involved in a request, so we needed a better way to present it.

These questions, and the corresponding answers, led us to our current implementation of automated failure testing. We created a service account (not associated with a real member) and gave it access to all of our products. This way, we could be confident that engineers could run tests on almost every part of the LinkedIn experience.

As for running at scale, we devised a two-fold solution. We first needed to scale the automated testing across our LinkedOut web application hosts, for which we leveraged the Celery distributed task queue framework for Python. Using a Redis broker, we're able to create tasks for testing each downstream (based on call tree data) and then distribute them evenly across the workers on our hosts.

For the actual testing of the pages, we leverage an internal framework at LinkedIn that allows for Selenium testing at scale. You can use a traditional Selenium WebDriver and point it at this framework's control host, and it'll run your commands on a remote host running your desired browser. We send commands to inject the disruption info into the invocation context via a cookie (which only functions on our internal network), authenticate the user, and then load the URL defined in the test.

We considered a few ways to determine success after injecting failures (user-contributed DOM elements to look for, etc.), but, for our first iteration, we decided to simply provide default matchers for "oops" pages and blank pages. If the page loaded by Selenium matched one of these default patterns, we would consider the page to not have gracefully degraded. We definitely want to make this more extensible in the future, so that users can define how their pages should look when they load successfully.

Finally, we needed an effective way to present these test results to our users. We figured that some users would like to see the firehose of data (every failure for every endpoint), but others would want a simpler view of regressions and new failures for defined tests. So we made both:

The individual automated test report

The automated test delta report

Lessons learned
As it turns out, service accounts don't always mirror real members' experiences on LinkedIn. We learned this in an amusing way: an SRE created a test to check for graceful degradation on the Profile Views page. The results were astonishing: every single downstream failure resulted in a test failure, meaning that the page returned an error.

But logging in as the test user revealed the problem: because this test user had no connections on LinkedIn, and nobody was visiting its profile, the Profile Views page returned an error, even without any failures injected. We proceeded to give it data (by viewing the test user's profile), but it brought the issue to light that test users aren't always great representations of what people really see on LinkedIn.

Our plan to avoid this in the future is by allowing users of LinkedOut to provide their own test users (prepopulated with the data they want them to have). This will make sure that people performing automated failure testing can do so across all user scenarios.

Feature targeting-based ramping
While the automated testing provides a basic framework for checking whether or not pages serve content to members while encountering downstream failures, there's also a need for ramping these tests to more than just a service account.

Users can define failure plans, which can include one or more failures for endpoints that a service calls, and then they can target people in these plans. But we wanted to make sure that people weren't being targeted in these failure plans without opting into this testing.

So, we built an opt-in mechanism that involves agreeing to an end-user license agreement (EULA). By signing into the LinkedOut web application and registering their member IDs, LinkedIn employees are accepting that they may be opted into failure testing. This way, our CEO doesn't wake up to a malfunctioning feed on LinkedIn and assume that the site is on fire, when in reality he had just been subjected to a failure experiment on LinkedOut.

The mechanism of triggering failures via feature targeting is really simple due to the maturity and power of our LiX experimentation framework here at LinkedIn. We create a targeting experiment based on the failure parameters the user chooses. Once the experiment is activated, the disruption filter picks up the change, via a LiX client, and fails the corresponding requests.

Using LiX also allows us to easily terminate failure plans gone wrong via the LinkedOut web application. We include a big "terminate" button with all the tests, and it tells LiX to turn off the experiment. Within minutes, the originally requested failures stop happening in production.

Chrome extension

The IC injection mechanism opened the door for quick, one-off testing in the browser by injecting IC disruption data via a cookie. But we took a reasonable guess that nobody would want to construct cookies on their own, adhering to our JSON schema, just to run failure experiments. This led us to a crossroads: do we build a bookmarklet or a browser extension?

Bookmarklets are really popular internally at LinkedIn, but the advent of Content Security Policy (CSP) has caused many people to declare that bookmarklets are dead. Browser extensions are self-contained and don't require that you inject content into the existing page, so we decided to start off with a Chrome extension (which we can later port to Firefox and other browsers).

From the start, we wanted to build a UI that requires minimal effort on the part of the user to apply failures to services involved in their requests. So we developed a simple flow:

Click a button to discover all of the services involved in the request.
Select the services for which you want to inject failures.
Click a button to refresh the page with those failures injected.

To discover the downstreams, we leverage a service within LinkedIn called "Call Tree," which consumes Kafka events produced by services when they handle requests and builds call trees that show all the steps involved. Call Tree allows you to set a grouping key as a cookie with your requests, which links together all the call trees it discovered for a given request.

Consequently, we designed the Chrome extension to: refresh the page with the Call Tree grouping key cookie set, discover the downstreams involved, and then display them in the UI. We used the Vue.js framework to implement it, due to its flexibility and simplicity. It looks like this:

There can be several services involved in a request at LinkedIn—in fact, there can be hundreds. So we added a search box that lets the user quickly filter to the endpoints/services that they care about. And, due to the granularity in the disruption filter, users are able to, for a given endpoint, inject a failure for only a specific Rest.li method.

Once the user selects failure modes for all applicable resources, the extension creates a disruption JSON blob for these failures, sets a cookie to inject it into the IC, and then refreshes the page with the failure applied. It's a very seamless experience with little work required on behalf of the user.

Lessons learned
Building the Chrome extension was not without its challenges. Although pilot users praised the functionality and design of the UI, they had a significant pain point—if the extension lost focus (and the popup went away), reopening the UI would just put it back in its default state, with no information about the endpoints the page calls or the failures that are applied.

This was understandably a huge inconvenience to users, and we needed a solution quickly. The first thought was to use Chrome's local storage to save state. But it's hampered by limited storage space and slow performance, so we ended up implementing a custom messaging solution between the Chrome extension and a background page, so we could store the state in memory.

Conclusion

We're really proud of the LinkedOut project, and without the dedicated efforts of several engineers and the support from our management, it wouldn't be where it is today. It's an extremely powerful platform that allows for all sorts of failure testing, and it can only get better and more extensible from here.

Acknowledgments

Pranav Bhatt, Bhaskaran Devaraj, David Hoa, Xiao Li, Maksym Revutskyi, Logan Rosen, Sean Sheng, Ted Strzalkowski, Nishan Weragama, Brian Wilcox, Jen Wohlner, Benson Wu, Jonathan Yip, and Xu Zhang.

Topics: Developer Experience/Productivity Infrastructure