Open Sourcing Dr. Elephant

Self-Serve Performance Tuning for Hadoop and Spark

April 8, 2016

We are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows. We first presented Dr. Elephant to the community last year during the eighth annual Hadoop Summit, a leading conference for the Apache Hadoop community.

Our Motivation

Hadoop is a framework that facilitates the distributed storage and processing of large distributed datasets involving a number of components interacting with each other. Because of its large and complex framework, it is important to make sure every component performs optimally. While we can always optimize the underlying hardware resources, network infrastructure, OS, and other components of the stack, only users have control over optimizing the jobs that run on the cluster.

The Birth of Dr. Elephant

Dr. Elephant

To help users understand and optimize their flows, we scheduled regular training sessions on how to tune the jobs, but this didn’t really solve our problem. At LinkedIn, we have employees with different levels of experience with Hadoop using different frameworks to run their Hadoop jobs. Additionally, the number of Hadoop users keeps increasing. This means that having regular sessions for different users on different frameworks is not an easy task, and it’s surely not scalable.

Up until a few years ago, the Hadoop team at LinkedIn analyzed flows on behalf of employees, gave advice on how to tune them, and approved them to run on production. As a first step to optimization, we looked at obvious optimization patterns based on some simple rules and gave advice to the users. But as the users grew, it was difficult to provide sufficient support resources due to delays in user intervention. There was no way to verify if we achieved optimal performance for the job or guarantee performance coverage. We therefore needed to standardize and automate the process.

The Hadoop experts reviewing the flows observed several common recurring optimization patterns, and based on this, we decided to embark on a new experimental project to optimize both Hadoop developer and Hadoop user time. This led to the birth of Dr. Elephant.

What is Dr. Elephant?

Dr. Elephant is a performance monitoring and tuning tool for Hadoop and Spark. It automatically gathers all the metrics, runs analysis on them, and presents them in a simple way for easy consumption. Its goal is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.

Why Dr. Elephant?

Most of the Hadoop optimization tools out there, whether open source or proprietary, are designed to collect system resource metrics and monitor cluster resources. They are focused on simplifying the deployment and management of Hadoop clusters. Very few tools are designed to help Hadoop users optimize their flows. The ones that are available are either inactive or have failed to scale and support the growing Hadoop frameworks. Dr. Elephant supports Hadoop with a variety of frameworks and can be easily extended to newer frameworks. It also has support for Spark. You can plugin and configure as many custom heuristics as you like. It is designed to help the users of Hadoop and Spark understand the internals of their flow and to help them tune their jobs easily.

How does Dr. Elephant work?

Dr. Elephant gets a list of all recent succeeded and failed applications, at regular intervals, from the YARN resource manager. The metadata for each application—namely, the job counters, configurations, and the task data—are fetched from the Job History server. Once it has all the metadata, Dr. Elephant runs a set of heuristics on them and generates a diagnostic report on how the individual heuristics and the job as a whole performed. These are then tagged with one of five severity levels, to indicate potential performance problems.

  • Dr. Elephant
  • Dr. Elephant's Dashboard

Dr. Elephant's Dashboard

Starting Dr. Elephant’s UI will load the dashboard. This will show several cluster statistics, like how many jobs ran on the cluster, the number of jobs that need some amount of tuning, and the number of jobs that are critical based on the heuristic analysis. Below these numbers you will find all the recent jobs analyzed by Dr. Elephant in the last 24 hours.

  • Dr. Elephant’s Search Page

Dr. Elephant’s Search Page

Dr. Elephant has a search feature that allows users to filter and search jobs based on the job/application ID, the flow execution ID, the user who submitted the job, the type of the job (Pig, Hive, etc.), the severity of the job, the severity of a specific heuristic, and the job finish date.

  • Dr. Elephant’s Job Page

Dr. Elephant’s Job Page

When you click a particular search result, you can view complete information on the job. This information page gives details specific to an individual MapReduce or Spark job. It includes reports on how the heuristics performed and some statistics that are helpful to the users. In addition, you can get information on the actual identity of the job, such as the workflow reference, the job reference, and the job history server reference. It also provides easy access links to view the job’s history and all the jobs that belong to the given job’s workflow.

  • A Flow History View from Dr. Elephant

A Flow History View from Dr. Elephant

  • A Job History View from Dr. Elephant

A Job History View from Dr. Elephant

In addition to reports on individual jobs, Dr. Elephant’s job and flow history page also provides a historic representation of the job and helps you compare a particular execution with previous executions. It computes a performance score for each execution based on all the heuristic severities and plots a graph. This graph will help analyze why a particular execution was poor as compared to another. For each point in the graph, it will also list the top three jobs or stages that need attention. Each colored dot represents a job in the flow history page and a heuristic in the job history page, while the color represents the heuristic severity. Upon hovering over these dots, you can get more information on the individual jobs/heuristics.

Dr. Elephant’s Expertise

Dr. Elephant has evolved since its birth in mid-2014 to include several useful features based on expert recommendations and suggestions from users. Broadly, here is a list of Dr. Elephant’s skills and capabilities:

  • Pluggable and configurable rule-based heuristics that diagnose a job;
  • Out-of-the-box integration with Azkaban scheduler and support for adding any other Hadoop scheduler, such as Oozie;
  • Representation of historic performance of jobs and flows;
  • Job-level comparison of flows;
  • Diagnostic heuristics for MapReduce and Spark;
  • Easily extensible to newer job types, applications, and schedulers;
  • REST API to fetch all the information.

A Family Doctor

Dr. Elephant is very popular at LinkedIn, where people love it for its simplicity. Like a family doctor, it is always on call and solves around 80 percent of the problems through simple diagnosis. It is designed to be self-explanatory and focused toward helping Hadoop users understand and optimize their flows by providing job-level suggestions rather than cluster-level statistics. Like a real doctor diagnosing a problem, Dr. Elephant analyzes problems through simple flowcharts. You can add as many heuristics or rules into Dr. Elephant as you like.

We use Dr. Elephant for a number of different tasks, including monitoring how a flow is performing on the cluster, understanding why a flow is running slowly, knowing what can be tuned and how to improve a flow, comparing a flow against previous executions, and troubleshooting, to name a few. Other tools use Dr. Elephant to generate useful reports using its REST API.  For instance, one tool green-lights the performance of a flow using Dr. Elephant, a prerequisite to run jobs on production clusters.

Dr. Elephant has been thoroughly integrated into our Hadoop ecosystem. At LinkedIn, we made it compulsory for developers to use Dr. Elephant as part of their development cycle. It is mandatory to get a green signal from Dr. Elephant for a flow to run in production. For any user issues, we first ask for Dr. Elephant’s report. This encourages users to write their jobs optimally and try to make all of their jobs appear green in Dr. Elephant. Dr. Elephant has been a part of LinkedIn’s culture for more than a year and has been helping everyone.

Next Play

Many new features are planned to take Dr. Elephant to the next level. We are constantly looking for new ideas that help improve developer productivity and improve the cluster usage. Apart from adding and improving heuristics and extending to newer job types, planned upgrades include:

  • Job-specific tuning suggestions based on real-time metrics;
  • Visualizations of jobs’ cluster resource usage and trends;
  • Better Spark integration;
  • Integrating more schedulers.

Code and Documentation

Dr. Elephant is open sourced under the Apache v2 License. You can find the source code and documentation on our GitHub page.

Dr. Elephant also has a Google Group where you can post queries and discuss ideas. Contributions and suggestions are welcome.

Friends of Dr. Elephant

Dr. Elephant is under active development by the Hadoop Dev Team at LinkedIn. Thanks to all the contributors: Akshay Rai, Anant Nag, Fangshi Li, Mark Wagner, Min Shen, Ratandeep Ratti, Shida Li, Subbu Subramaniam, and Yitong Zhou, with technical guidance from Carl Steinbach, Shankar M, and Vijay Ramachandran. Thanks to Suja Viswesan, Abhishek Agrawal, Kapil Surlaker, and Igor Perisic for supporting this project.