Automating Your Oncall: Open Sourcing Fossor and Ascii Etch

Steven Callister

Staff Site Reliability Engineer at DoorDash

December 14, 2017

One of our sayings in Site Reliability Engineering (SRE) is that the goal of your job is to “automate yourself out of the job.” While some may have concerns of being replaced by robots, SRE’s see the value of automating work. It opens up time, removes tedious or repetitive tasks from a workflow, and allows our engineers to spend their valuable time on more complex issues. Used properly, automation opens the door for us to do more thorough investigations of site issues. And of course, if you’re the oncall engineer when a service breaks at 3 a.m. (as I have been many times before), the ability to automate aspects of diagnosing and repairing the issue is very welcome.

Today, we are pleased to announce that we have open sourced two new tools to assist engineers in automating the investigation of broken hosts and services: Fossor and Ascii Etch. Fossor is a plugin-oriented Python tool and library for automating the investigation of broken hosts and services. Ascii Etch is a Python library that takes streams of numbers and turns them into visual graphs using ascii characters, originally created to help display output from Fossor. We faced some real challenges that led to the creation of these tools. This post will cover these, and also how Fossor can be adapted and tailored for specific use through the creation of plugins.

Background

One of the most powerful aspects of automation is harnessing a computer’s ability to perform tasks in parallel and thereby parse through vast amounts of data quickly. A typical site issue investigation requires performing a sequence of multiple investigative steps, such as the 10 useful commands listed in this Netflix engineering blog post. However, manually tracking commands takes valuable time, especially when dealing with increased latency or a full outage. Having experienced the pain of performing the same repetitive steps again and again during my own oncall shifts, I concluded that writing a tool to perform some of these basic checks in parallel would speed up the mean time to resolution. Taking the idea even further, I wanted a tool that could perform checks tailored specifically to my services while still having the flexibility to incorporate newly-developed checks in the future. Fossor was created to do just that.

Fossor architecture and design

In Latin, the word fossor means “grave digger” or “one who digs,” which fits well with Fossor’s purpose of helping users to dig into server or application issues. From its initial conception, an important feature in Fossor’s design was to allow others to easily expand its abilities by adding their own checks through the use of plugins. To ensure optimum performance, even with a potentially large plugin library, Fossor was designed with several key features.

First, to mitigate the problem of having too much output that could potentially obscure key data, Fossor only reports information to the user when it is deemed helpful, as defined by each plugin. This tailored output allows for easy access to reported information. The incorporation of Ascii Etch in certain plugins also allows for a graphical output of data, making the reports easier to read.

Second, to help curb the introduction of performance- or application-breaking bugs into the Fossor tool, Fossor separates its code into two parts: the engine, and the plugins. The engine is responsible for coordinating plugin execution. It collects the plugins and then carefully runs each one in its own process. By isolating each plugin in its own process, the main engine is protected from a single plugin failing and crashing the application. This plugin resiliency was specifically built in to allow Fossor to safely manage plugins from many contributors, thereby creating a platform for the bridging of expertise among users.

Plugin anatomy

Plugins are small classes that must implement a single method: the run method. If the run method returns output, this indicates the output is “interesting” and should be reported back to the user. Below are two examples of plugins. The run method accepts a single argument, a Python dict named “variables,” used to optionally provide external information to the plugin. All plugin types use this same basic structure.

Example of a Check plugin

Example of a Variable plugin

The Variable and Check plugin examples are nearly identical. Both examples implement the run method with a single argument, variables. Both examples return a sample string if the variable “Debug” exists and is True. The only difference between these two plugins is the class that each is inheriting from. ExampleCheck inherits from the Checks class, and ExampleVariable inherits from the Variable class.

Plugin types

There are three types of plugins that Fossor supports: variable gathering plugins, check plugins, and report plugins. The Fossor engine executes plugins in the flow shown below.

Variable gathering plugins
Variable plugins are an optional way to gather and share information that several check plugins may depend on. The advantage of using these plugins is that dependent information only needs to be gathered a single time. If a variable plugin returns an object, that will then be stored under a variable with the same name as the variable class.

Just like check plugins, variable plugins can depend on variables as well. The engine will continue executing variable plugins that have yet to return results until no new variables have been found. This means it is possible to chain dependent variable discovery, as is shown below. In this case, the user provided the product when Fossor was run. The pid variable plugin does not return a result unless it has a product. The LogFiles variable requires a pid, so on the third iteration, this plugin returns a result.

Check plugins
Check plugins perform a simple investigative action. If something “interesting” is found, they output this information as a string. Unlike variable plugins, each check plugin is run a single time. Check plugins can be generic or application-specific. Here are some examples of both generic plugins and plugins that we use for specific applications/platforms at LinkedIn:

table
Generic plugins	Application/platform-specific LinkedIn plugins
Memory fragmentation	Recent deployments to this host or application
Recent kernel messages (dmesg)	Is this application a canary (beta) release?
High load averages	Do any of this application’s downstream services have latency?
Error patterns in the logs	Does this application have any non-standard downstreams?
High memory usage	Does this application/host have any outstanding alerts currently firing?
Network errors
High disk usage

Here are some example check plugin results.

Downstream Latency plugin (LinkedIn-specific)

This plugin polls LinkedIn’s service metrics to check each downstream service for latency. If the latency appears abnormal, the plugin prints an ASCII graph back to the user using the Ascii Etch library.

BuddyInfo Memory Fragmentation plugin

Since this plugin is generic and not LinkedIn-specific, it is available in Fossor by default. The BuddyInfo plugin checks for memory fragmentation, and if the page table shows signs of memory fragmentation, it outputs this information to the user.

Report plugins
Report plugins take the output from the check plugins and display a report back to the user. The default report displays a table back to the user on the command line. The Fossor engine streams the output from each check plugin asynchronously to the report plugin. This means that the user can immediately begin reading/analyzing the output from faster-running plugins while the longer-running plugins continue to gather and analyze data.

Plugin discovery

When the Fossor engine is initialized, it checks two locations for plugins. The first is the Fossor module. The second is the disk location /opt/fossor. The engine scans both these locations recursively, adding all plugins that properly inherit from a Fossor plugin class, such as a variable, check, or report. The disk location can be overridden using the --plugin-dir flag on the command line.

Ascii Etch

Ascii Etch was originally created to help display meaningful latency graphs back to users on the command line in Fossor. We’ve found that this is more helpful than simple text for quickly identifying anomalies in data. The original downstream latency plugin for Fossor displayed latency average, minimum, and maximum. While these are useful stats, a quick graph is much clearer and more informative of whether or not there is actually latency downstream. Now that Ascii Etch is open sourced, any project wishing to visualize data in this manner can do so.

Using Ascii Etch is quite simple. Here is an example that graphs a list of integers:

Ascii Etch also supports vertical value scaling, as well as horizontal value compression. This means that if we give it something that graphs 1,000 values of varying height, it will manage to still fit the graph to the desired width of 50 and height of 10.

Here are some additional examples of randomly generated lists being rendered in Ascii Etch using the above code:

Future plans and developments

Since being introduced at LinkedIn, Fossor has helped to speed up the process of identifying the cause of application issues by performing investigative checks in parallel and reporting back only the identified useful information. This has helped streamline the debugging process through a single command. We have also found that Fossor’s usefulness extends into gathering information that is helpful for the user who is oncall. Examples of this include identifying application tags, ownership of services, and listing an application’s non-standard downstreams.

The advantage of Fossor’s plugin-based approach is that it can be incredibly specific through the creation of distinct plugins, yet also vast in its library of contributions. Since Fossor becomes more useful with each additional plugin, we hope the open source community finds value in using this automation tool and continues to contribute to its budding library of investigative checks.

Acknowledgements

Fossor and Ascii Etch were conceived and written by Steven Callister. Many members of the LinkedIn Data SRE team have participated in the brainstorming and review of these tools. The following individuals have also helped expand Fossor’s functionality by contributing plugins internally at LinkedIn: Brent McKee, Cameron Berkenpas, Christopher Walker, Eric Manuel, Greg Banks, Jesse Ward, Jamie Luck, Joel Gomez, Matt Knecht, and Nick Brown. The Fossor logo was created by Néna Riley.

Topics: Developer Experience/Productivity Open Source