A Deep Dive into Simoorg

Arjun Shenoy

Full Stack Developer

March 28, 2016

Failure induction is a process of non-functional testing in which a set of failures is induced against a perfectly healthy service. This process is critical because it provides valuable insights into the behavior of the service under unexpected failure scenarios. Data gathered during the failure induction process can be used to make improvements by completely avoiding or significantly cutting down troubleshooting time when similar issues happen under production traffic. We evaluated the existing open source solutions, but they did not meet our requirements mainly due to the lack of customization and extensibility.

We wanted a system that would allow us to explore multiple custom failure scenarios while being flexible enough to work across several different platforms. Other well-known failure induction frameworks like Netflix's Chaos Monkey are usually designed for a specific hosting environment and were found to be less adaptable than what we needed.

Knowing this, we developed Simoorg as our failure induction framework. The main motivation behind creating Simoorg was to have a powerful, yet simple and extensible framework which could be used on a variety of applications. Simoorg is a creature from Persian mythology much like the Phoenix. We cheekily chose the name because of the way that our framework will cause clusters to fail, only to rise again from the ashes. The entire framework is written in Python and is publicly available as an open source package on Github.

A failure is any undesirable scenario which can have an impact on the performance of a given service. Failure induction can be considered similar to runtime fault injection but unlike runtime injection, failure induction is not limited to a single software system and is not meant to be used as a part of a software development cycle, even though the frameworks meant for failure induction can very well serve the purpose.

The main motivation behind the process is to observe how a healthy cluster will respond to a particular failure, or a set of failures. This observation would tell us many things about the service, such as the amount of traffic it can hold without impacting the performance, the behavior of the service in a low-latency environment versus the same in a high-latency situation, the kind of exceptions raised when the memory units are corrupted, and the duration and changes that take place during a full GC, among others. This knowledge can be used for both designing improvements and troubleshooting effectively when issues arise in real life.

Failure induction can also be considered a means of resilience testing. We can create failures, which force one or more hosts to shut down in an ungraceful manner and restart the application. The other non-functional aspects where failure induction can be used are performance benchmarking, resource optimization, determining if a use-case can be supported by the application, stress testing, scalability testing, and operational areas such as threshold determination and monitoring.

The process of failure induction generally involves three steps:

Step 1: Introducing the failure. (A health-check can be done to ensure the state of the cluster before this stage but it is optional to do so.)
Step 2: Observing the cluster in the impacted state. (The time for which the cluster is in the impacted state can vary based on the tool/framework used.)
Step 3: Reverting the cluster back to healthy state.

The methods of failure introduction and failure reversion can differ based on the tool and framework used and also on the cluster on which the process is being done.

There are several other frameworks and tools currently available in the market, like Chaos Monkey, but Simoorg differentiates itself from the rest in several key features:

Custom Failures: Users can develop their own failure scenarios and reversion scripts specific to the cluster at hand, like IO failures, traffic surge, graceful restart, and data corruption. Simoorg also provides several basic failure scripts such as ungraceful shutdown and graceful restart.
Scheduling: The failures can be scheduled to go off in a particular order at a particular time (deterministic scheduling), in a completely random manner (nondeterministic scheduling) in any given time-frame, or in a combination of the two. Furthermore, the time between failure induction and reversion can also be configured based on the need.
Configurable constraints: Users can set constraints such as the number of hosts affected at a time, the minimum gap between failures, and the maximum duration of a failure. Each tested service will have a separate config file associated with it.
HealthCheck: The framework can perform a health check (optional) on the cluster before each failure is introduced.
Comprehensive logging: A key new feature for our framework is detailed logging.
OS independent: Since it is written in Python, the framework can be used in a variety of operating systems.
Pluggable architecture: The framework has a pluggable architecture in the sense that most of the important components can be customized to support service-specific requirements. These configurable components (plugins, discussed later on) also have default classes provided along with the Simoorg package.

Here are the architectural components of Simoorg:

Moirai
Atropos
Topology
Scheduler
Handler
Logger
Journal
HealthCheck
API Server

Next, we'll explain each component in more depth.

Named after the Greek Three Fates, Moirai (also known as root observer) is the central single-threaded process which monitors all the activities of the framework. Moirai also works as an entry point from which the API server can obtain details about the testing process. For each tested service, Moirai spawns independent Atropos process, each Atropos process is responsible for an individual service.

Atropos (also known as observer) is an individual process spawned by Moirai for taking care of all activities related to a given service. It is responsible only for that service. The name also comes from Greek mythology and is the oldest of the three Fates – the unturnable. Each Atropos has its own instances of Logger, Topology, Scheduler, HealthCheck, Journal and Handler.

Upon initialization, the Atropos reads the fatebook of the particular service. The fatebook of a service is a config file, which has all the details of the failures to be induced and the reversion methods for the failures as well. The config can also contain specific constraints, such as maximum number of failures at a time, and gap between failures. The Atropos sleeps until the desired conditions for failure induction on the cluster are met. Once these requirements are met, the failure is introduced in the cluster using the Handler (see below). The requirements to be met include:

Dispatch time for the particular failure has arrived.
The number of failures (or affected hosts) is less than specified in the fatebook.

The Atropos communicates with Moirai using two standard Python queues:

The service information queue gives details like the service name, the topology, and the plan that is being followed for the testing.
The event queue captures the status of each of the failures and reversions performed on the target cluster.

Architecture of an individual Atropos unit

This component allows users to configure one or many hosts on which the failures can be induced. The number of hosts can be restricted to a particular subset of the actual cluster, the whole cluster, or sometimes a combination of hosts running different instances of the service.

The default class of Static Topology is available along with the Simoorg open source package, which allows us to provide a list of hosts on which failures can be induced. The class has a function that will return one random host from the list provided at a time.

The scheduler component generates the plan of failure induction. In essence, a plan is the order in which failures will be induced and reverted. It specifies the time a failure is to be induced and the time it is to be reverted. As mentioned before, the plan can be deterministic or nondeterministic depending on how the end user needs it to be.

A plan is implemented using a simple Python dictionary with the key being the failure name and trigger time being the value. The Handler takes the failure name and executes the induction script associated with the failure name in the fatebook.

The default scheduler present in the Simoorg package is nondeterministic.

Handler is the unit that is responsible for implementing the failures as well as reverting them. Each failure definition should have a handler associated with it. The Handler is used by the observer (Atropos) to implement the failure scripts according to the schedule.

The most commonly used handler is SSH which runs actions over SSH using shell script mechanism. The SSH Handler is the default in the Simoorg package. Other handlers can be for Salt, AWS, and Rackspace, among others.

The Logger, as the name suggests, is used to log the working of the observer. This can be used to monitor the activities of the observer as a whole as well as the individual components. As the standard procedure, there are various levels of logging varying from [INFO] to [VERBOSE] which can be specified in the configs.

The Journal keeps track of failures induced and reverted in the cluster and makes sure that the number of nodes affected in the cluster does not go beyond the specified constraint. At the Journal level, failure induction and reversion are treated as separate “events.”

Apart from making sure that the target cluster is not affected, the Journal also stores the state of the observer and helps in recovery in case of observer crash. This feature is not fully supported in the current release available in github. We are releasing it in the coming revisions.

The HealthCheck component is used to ensure the stability of the cluster and the individual nodes before the failure is induced. Custom scripts for HealthCheck can be added specific to the particular target application. The HealthCheck component needs to return a success value before induction. Otherwise the Scheduler skips the current failure cycle. This ensures that we are not aggravating any existing issues and we let the cluster go through self-healing routines and recover. It also allows us to start with a known state for testing. If a HealthCheck is not defined, failures will be induced as scheduled assuming the cluster was able to recover.

This acts as a front-end for obtaining the details of the individual observer units. The API server is running as a separate process that talks to the Moirai through Linux FIFOs (named pipes). It can provide all the details of the currently active observer units, such as the services on which failure induction is carried out, the individual hosts which are affected currently etc. However, it does not provide historical data (i.e. if the original Simoorg instance crashes, we can no longer fetch any data from the API).

Some components of the architecture are completely customizable as needed. In Simoorg there are four such components:

Topology
Scheduler
HealthCheck
Handler

These four components can be modified so that they are compatible with the environment and are able to leverage the available functionalities to the maximum extent. This modification is done by creating new classes that implement a set of basic functionalities pertaining to the components. In other words, each of these components has a set of base classes and interfaces that can be inherited and implemented for full extensibility.

If this overview has piqued your interest, please visit the GitHub repo to learn more. There you will find everything you need to get started with Simoorg, including further implementation details, detailed documentation of the Simoorg API, and (of course) the full source code for this project.

I would like to thank the Simoorg development team: Mayuresh Gharat, Tofig Suleymanov, and Sarath Sreedharan especially Tofig for providing a good documentation of the design and for taking time to give a proper walk through. Finally, this project would not have been possible without support from Linkedin Online Infrastructure Leadership: Hardik Kheskani, Kevin Krawez and Sriram Subramanian.

Topics: Open Source A/B Testing/Experimentation Infrastructure