A Checkup with Dr. Elephant: One Year Later

Carl Steinbach

March 6, 2017

This post has been updated to note the release of Pepperdata's Application Profiler, a commercial project based on Dr. Elephant.

Last April, we announced the first open source release of Dr. Elephant, a performance monitoring and tuning service for Hadoop and Spark jobs. That announcement marked the culmination of two years of internal development work and more than a year of production use of Dr. Elephant on LinkedIn’s Hadoop clusters. This blog post covers some of the key lessons learned from tuning Hadoop jobs at LinkedIn, some exciting new features added since we open sourced the project last year, and outlines some ways in which you can get involved in the community in 2017.

The origins of Dr. Elephant

The Hadoop ecosystem at LinkedIn is very diverse. Backend metrics systems, experimentation systems, data products, and over a dozen data processing frameworks run on our Hadoop infrastructure. Everything from business analyst reporting to the systems our members interact with on a daily basis (e.g., People You May Know) uses Hadoop. Close to a thousand users interact with this infrastructure, and hundreds of thousands of data flows run on it every month. These numbers continue to grow.

The efficient operation of a Hadoop cluster requires careful tuning of both the cluster infrastructure and the jobs that run on it. Tuning Hadoop jobs is a nontrivial task for several reasons:

Not all users have extensive experience with Hadoop, making it more difficult for them to correctly interpret performance feedback and make adjustments accordingly.
Users are often dealing with hundreds of configuration parameters at a time, some of which interact with each other and can impact performance.
Frameworks often have different solutions for the same problems—something that can be a challenge for anyone who is helping a user tune their job, to say nothing of presenting a learning curve for the user themselves.
Critical information about job performance is scattered between client-side logs, task logs, the resource manager, multiple counters for each task, a global view, etc. None of these things provide a user-centric view of job performance.

Early attempts at treating pain points

While we do have a team of Hadoop experts at LinkedIn, we realized that it was a very inefficient use of their time to make them responsible for assisting all users with optimally tuning their own Hadoop jobs. At the same time, it would be equally inefficient to try and train the thousands of Hadoop users at the company on the intricacies of the tuning process.

Another approach we considered involved setting up a council where users submitted jobs for approval before they could run. However, simply telling a user that their job can or can’t run doesn’t give them any information about how they can improve future jobs. The result, once again, is an inefficient process that burdens users with an extra approval step without the added reward of increasing their understanding of Hadoop tuning.

While there were a variety of existing operator tools for tuning clusters at the global level, we soon realized that there were no good solutions aimed at users for the workflow and job-level. The difficulty of building such a tool is compounded by the diversity and velocity of the Hadoop ecosystem, as well as the by the challenge of making any solution accessible to users with a wide variety of backgrounds and skill levels.

Over the years of working with Hadoop, these separate factors—users with varying levels of Hadoop experience, a large number of systems using the Hadoop infrastructure, and a smaller core team of experts—led to recurring issues at LinkedIn. We found that sub-optimized jobs were wasting the time of our users, using our hardware in an inefficient manner, and making it difficult for us to scale the efforts of the core Hadoop team.

The craftsmanship cure

At LinkedIn Engineering, one of our core tenets is craftsmanship—the idea that engineers should take pride in their work because it’s directly reflective of their own judgment and skills. Consequently, when we looked to create a solution to our Hadoop-tuning problem, we ideally wanted one that would enhance users’ tuning capabilities without automating the process entirely for them, thereby leaving in that important component of human quality-assurance.

What we needed to introduce to the job-tuning equation was a series of questions like those asked by a physician making a diagnosis: a step-by-step process that guides the user through the problem-solving process, while also educating them at the same time.

So we created Dr. Elephant, a system that automatically detects under-performing jobs, diagnoses the root cause, and guides the owner of the job through the treatment process. Dr. Elephant makes it easy to identify jobs that are wasting resources, as well as jobs that can achieve better performance without sacrificing efficiency. Perhaps most importantly, Dr. Elephant makes it easy to act on these insights by making job-level performance tuning accessible to users regardless of their previous skill level. In the process, Dr. Elephant has helped to ease the tension that previously existed between user productivity on one side and cluster efficiency on the other.

Like any physician, Dr. Elephant provides advice but doesn’t force you to follow it. With so many users and different use cases, we didn’t want to build a system that automatically adjusted jobs without a user’s input. After all, this would make Hadoop job tuning a black box, defeating the purpose of using Dr. Elephant as a tool that incrementally teaches our users more about how to tune their jobs through practice. Furthermore, a system that automatically tunes jobs would have to be nearly 100 percent accurate in all cases, whereas a system that offers guidance but ultimately defers to the user’s discretion combines the best of man and machine.

Dr. Elephant isn’t the first attempt at creating a tool that tunes Hadoop jobs. Other projects, like Vaidya, were introduced to address many of the same problems. We believe that there are a few design and organizational considerations, many of which we didn’t fully appreciate at the time, which have led to such widespread adoption:

Dr. Elephant is a service as opposed to a command line tool. This means that users can easily visit Dr. Elephant like any other webpage. Additionally, we can monitor traffic to Dr. Elephant and measure interest from the user base, in real time.
Additionally, we provide a global view of all jobs in the system, not just a single user’s job. This has lead to the emergent behavior of positive social pressure, which in some ways mimics the pressures of open source software projects: no one wants to have a poorly tuned Hadoop job on the Dr. Elephant dashboard, just like no one wants to submit poorly-written code to an open source project.
At LinkedIn, multiple teams share our Hadoop infrastructure, making it much more important for engineers to be “good citizens” when it comes to running efficient jobs. We have a hunch that if teams were given control of their own clusters, there would be much more variability between teams as to how much time was spent tuning their processes.

Open source

A lot has happened in the ten months since we open-sourced Dr. Elephant. Activity on Github and the Dr. Elephant mailing list has been strong since day one, and the Dr. Elephant developers at LinkedIn have made it a priority to answer questions and handle pull requests. Most of the development goals listed in the original Dr. Elephant blog post have been accomplished, and many of these — including support for the Oozie and Airflow workflow schedulers, improved metrics, and enhancements to the Spark history fetcher and Spark heuristics — were contributed by developers outside of LinkedIn. We have also been happy to see that many people have been able to benefit from running Dr. Elephant including companies like Airbnb, Foursquare, Hulu, Pinterest, and more. Many of these new users have already contributed back to Dr. Elephant, and we’ve even gotten interest from companies who wish to integrate Dr. Elephant into their commercial product offerings, including Pepperdata and their new Application Profiler product.

Try it out and get involved

If you use Hadoop or Spark, we hope you will give Dr. Elephant a try. The Dr. Elephant wiki provides setup instructions and a guide for administrators, and help is available on the Dr. Elephant mailing list and via Github Issues. If you are a developer and want to contribute to the project, we recommend taking a look at the contributor guide and attending one of the weekly meetings on Google Hangouts. And finally, if you’re in town for Strata + Hadoop World in San Jose, we hope you will consider attending the Dr. Elephant Meetup on March 16th.

Topics: Open Source