Resilience Engineering at LinkedIn with Project Waterbear

Bhaskaran Devaraj

Founder @ AI Stealth Startup

November 10, 2017

Coauthors: Bhaskaran Devaraj and Xiao Li

Over the last several years, many companies have discussed ways to improve the resiliency of their services and infrastructure. Many projects, like Netflix’s Simian Army, have spawned open source projects that have been adopted by other companies. Other discussions about resilience engineering focus on cultural and process-driven approaches to making teams more aware and responsive to potential problems before they happen.

At LinkedIn, SREs have been working cross-functionally with service owners and their teams on a project we dubbed “Waterbear” (after the nickname for the tardigrade, a notoriously resistant creature that can survive, among other places, in the vacuum of space). Think of Waterbear as providing “application resilience” as a service; SRE teams own the domain and the problem. As SREs, we measure, analyze, and provide best practices to help improve the resilience of each application for the application owners and engineering teams.

What follows is an outline of the problems Waterbear addresses, our approach to addressing those problems, and an overview of the systems and processes we implemented. The goal of this post is to share our lessons learned with the broader SRE community and contribute to the discussion about resilience engineering approaches in the broader technical community.

Waterbear: A holistic resilience engineering approach

For our internal “hierarchy of needs,” having a site up and secure is the foundation on which all other services at LinkedIn must be built. From multiproduct development to the API-driven approach of our internal microservices, the other levels of our hierarchy pyramid enable us to do amazing things with feature mobility, rapidly scaling out new services, etc. But as LinkedIn has scaled and grown, this increased complexity has also brought with it more interdependency, an increasing number of points of failure, and new failure domains. As SREs, we understand that with complexity, failure is inevitable. One of the things we work to do is to limit those failures.

At the outset, Waterbear had several design goals:

Making sure we were running on a resilient cluster of resources
Creating or maintaining robust infrastructure for our services
Handling failures intelligently
Gracefully degrading when required
Increasing SRE happiness by designing self-healing systems

While this seems like an extensive list of requirements, we felt like anything else would fall short of our ambition to build a resilient system not just LinkedIn’s for present but also for the company’s future. In short, we felt that it was the only way to scale.

After discussing the changes that were feasible to make to achieve these goals, we were able to categorize these requirements into three large “buckets” that encapsulated broad requirements for the project:

Chaos engineering: Projects that would directly demand increased resilience from our applications and infrastructure.
Cultural changes: This included increasing transparency for problematic services and designing our applications and infrastructure to take advantage of the concept of “graceful degradation” in the event of service failure.
Rest.li improvement: Rest.li is the open source REST framework created and used at LinkedIn. Our project necessitated changing the default behavior of the framework itself, namely by providing the value for important settings of the framework and introducing resilience-focused features to ensure that our planned graceful degradation functionality could be built into everything we do.

Below, we’ll discuss these aspects of the projects in detail.

Chaos engineering

Application failure (LinkedOut)

LinkedOut is a framework and tooling to test how user experience will degrade in different failure scenarios associated with downstream calls. It provides a seamless way to simulate failures across our application stack with minimal effort.

LinkedOut is implemented within the Rest.li client, which gives most applications at LinkedIn this functionality "for free," if they desire it. Enabling LinkedOut for a given service is as simple as bumping a dependency and enabling a configuration value. From there, any LinkedIn employee can use our Ember web application or Chrome extension to trigger failures throughout the stack. We also use LiX targeting (described below) to ensure that downstream members are not affected by failure testing.

When creating a failure injection tool, we wanted to make it as granular as possible in order to minimize the blast radius of failure testing. We leveraged the LiX A/B testing and experimentation framework, which is the most prevalent and efficient treatment targeting system already in use at LinkedIn, to give us this control over where and for whom failure will occur. LiX

With LiX targeting, we can minimize disruption to parameters as narrow as a specific user making a specific request with a specific Rest.li method. And we've implemented three failure modes that can be enabled for a given request: timeout, exception, and delay, which are all tunable to the user's desire.

Furthermore, with the ability to target failures to specific requests, we encourage developers to run these tests in production to provide an accurate picture of what will happen during downstream failure scenarios. The targeting gives us confidence that, even if a user is testing failures for all downstream services at once, there will be no impact on LinkedIn members.

Our latest home page depends on more than 550 different endpoints in its dependency tree. It is very difficult for developers to ensure expected “graceful” degradation on the home page for every failure scenario involving this many endpoints. With LinkedOut, we implemented a “disruptor” filter in the Rest.li client. In this filter, we inspect the LiX context for member ID and treatment. The member ID will decide if this request should be disrupted, and the LiX context will determine which failure mechanism we will introduce. To lower the learning curve, we also created a Chrome extension, which will provide all dependency downstream information for any page in the LinkedIn.com domain and will allow users to click and select specific downstreams to simulate failure by passing in the LiX context to the request.

A few interesting things about LinkedOut include the use of LiX as the targeting system for treatments and the decision to adopt client-side failure logic in Rest.li.

LiX treatment targeting: Using a well-known and internally popular system, instead of using other methods like Zookeeper or invocation context (IC) headers, lowers the cost of onboarding and avoids system duplication and adding additional complexity to the LinkedIn stack.
Client-side failure logic: This approach increases the adaptability and scalability of the implementation. Integrating the testing framework is as simple as bumping up your Rest.li client version and enabling the testing mode via a config change.

Infrastructure failure induction (FireDrill)

FireDrill provides an automated, systematic way to trigger/simulate infrastructure failure in production, with the goal of helping build applications that are resistant to these failures.

For example, let’s look at a short list of how we classify various failure states:

Host half failure: DNS pollution, time out of sync, disk write failure, high network latency
Host offline: host power failure, network disconnected
Rack fail: whole rack offline
Data center failure: data center lost data link

We started this project by creating host-level failures. The type of failures we chose to simulate initially are:

Network failure
Disk failure
CPU/Memory failure

We use modules written in SaltStack to simulate these failures. Our goal with FireDrill is to move to an automated setup where we can simulate these failures constantly in production. Once we get comfortable with simulating host-level failures, in the next phase we intend on using FireDrill to create power, switch, and rack failures in our data centers.

Cultural changes

Graceful degradation

The term “graceful degradation” is almost self-explanatory. One part of this concept is that when a non-core dependency experiences failure (for example, an ad failing to load on a page), the pages will still load with reduced information, but won’t impact user experience as much as sending them to an error/”wiper” page would.

The process for thinking through a graceful degradation scenario follows a specific set of steps that includes SREs working directly with service owners and other teams:

Identify “core flows” that should NOT fail
Identify and reduce “core dependencies”
Creatively come up with ways to “gracefully degrade”
Determine best practices and continuous checking in production

In the event of a core dependency failure, there are several scenarios that could be activated to ensure a better user experience:

Trade with time: Assuming a core dependency in another datacenter or PoP didn’t fail, make a cross-colo call to get the dependency info.
Trade with complexity: Look into alternative sources for similar data or use cached data to mitigate the impact.
Trade with feature/relevance/content: For example, if complex search isn’t available, degrade to simple search. If personalization info isn’t available, degrade to a default experience within that feature for everyone.

Implementation notes

Cultural changes are stereotypically one of the hardest to implement in large engineering organizations. To roll out these changes, we took a measured approach that showcased successes with pilot services to secure buy-in from the other teams.

We also implemented a full-blown communications strategy to advocate for more teams to get involved in the Waterbear project. This included:

Roadshows to other eng teams and orgs
Internal tech talks and training videos
Adding resilience-focused sessions in new hire onboarding training (coming soon)
Designing competitive games/challenges to encourage people to discover/fix the resilience issues in their product (coming soon)
A blog post about Waterbear (the one you are reading!)

Finally, we are also considering whether to implement a “Waterbear certified” status for services, which would be a more formal version of the recognition we give out for projects rated as "A+" by service scorecard (SSC), an internal benchmarking tool we use to gamify operational excellence.

Rest.li client changes

Degrader tuning

Almost all LinkedIn inter-service communication is conducted over Rest.li protocol. If we can make the protocol more failure-tolerant, it will help us cover a lot of common resilience negligences.

The Rest.li framework is super powerful, but has makes users turn a lot of knobs to make it work properly. As a result, most service owners just use the default setting. However, this can be suboptimal, because different services have dramatically different profiles (QPS, latency, etc.).

We noticed the problem from the user-facing latency/error spikes was caused by single-node failures in a relatively large cluster.in theory, in which case the Rest.li degrader should act as an intelligent load balancer and quickly remove the problematic node out of rotation. In reality, because the degrader algorithm takes the default setting as input, it tries to wait for 10 seconds to mark a request as “degraded,” and is too lenient/slow on punishing the badly-behaving node for services that take thousands of QPS and expect less than 10ms latency; for these, it almost never acts properly. As a result, the single-node failure will bubble up to user-facing errors.

D2Tuner is a project we’ve worked on that will give recommendations of D2 settings based on each service’s historical data. At the end of the day, we want to use D2Tuner to allow every service owner to tune their service degrader properly, which will allow the Rest.li framework to take care of most of the single-node turbulences in the production environment and prevent them from bubbling up to users.

Lessons learned

Reflecting on this project, we found that there were two lessons learned that we thought other organizations could benefit from.

Values and culture matter

There are many factors that go into problem-solving, but one that does not get discussed enough is the role a company’s values play in the process. LinkedIn Engineering is a values-driven organization, and we feel that these values can contribute positively to the way a company approaches a problem like resilience engineering. When we discussed our “hierarchy of needs” earlier in this post, we were actually talking about “craftsmanship,” a LinkedIn Engineering core company value.

At a company level, one of our core values is the concept of “members first.” Serving 500 million members in more than 200 countries leads to a huge potential impact for any site outage. This influenced our design thinking, because we needed to begin with the idea of minimizing the impact that any failures had on a member’s experience.

Every team can contribute to resilience engineering efforts

Finally, another common refrain in the LinkedIn SRE organization is, “attack the problem, not the person” (a reflection of our company values, “act like an owner” and “relationships matter”). Knowing that an individual service owner won’t be shamed or verbally attacked when a problem occurs creates a much more positive environment for problem-solving. This also makes it easier for the SRE organization to influence the decision-making process throughout the engineering teams in order to make changes to shared infrastructure (Rest.li, etc.) and get the rest of the company to think about resilience as part of the product design process. In many ways, Waterbear may not have been possible with a different company culture.

One additional piece that has been a pleasant surprise for us is the general reaction from the leadership and development partners. We were expecting hard pushback because introducing failures in production and requesting resilient features will conflict with product delivery timelines. However, the development team and leadership team have been very supportive. As long as we have scientific approaches to validate our hypothesis against failure scenarios, the ability to limit the blast radius of failure, the capability to derive clear action items to improve system resilience, and can build proper tooling/systems to make running such tests extremely easy, every team at LinkedIn can contribute to resilience engineering efforts.

Acknowledgments

Our team would like to thank several individuals who provided invaluable help during the design, testing, and implementation of Waterbear. Specifically, we’d like to thank Anurag Bhatt, David Hoa, Anil Mallapur, Nina Mushiana, Ben Nied, Jaroslaw Odzga, Maksym Revutskyi, Logan Rosen, Sean Sheng, Ted Strzalkowski, Nishan Weragama, Brian Wilcox, Benson Wu, Jonathan Yip, and Xu Zhang.

Topics: Developer Experience/Productivity Infrastructure