As the use of Hadoop grows in an organization, scheduling, capacity planning, and billing become critical concerns. These are all open problems in the Hadoop space, and today, we’re happy to announce we’re open sourcing LinkedIn’s solution: White Elephant.
At LinkedIn, we use Hadoop for product development (e.g., predictive analytics applications like People You May Know and Endorsements), descriptive statistics for powering our internal dashboards, ad-hoc analysis by data scientists, and ETL. To better understand the usage of our Hadoop cluster across all of our use cases, we created White Elephant.
While tools like Ganglia provide system-level metrics, we wanted to be able to understand what resources were being used by each user and at what times. White Elephant parses Hadoop logs to provide visual drill downs and rollups of task statistics for your Hadoop cluster, including total task time, slots used, CPU time, and failed job counts.
White Elephant fills several needs:
- Scheduling: when you have a handful of periodic jobs, it’s easy to reason about when they should run, but that quickly doesn’t scale. The ability to schedule jobs at periods of low utilization helps maximize cluster efficiency.
- Capacity planning: to plan for future hardware needs, operations needs to understand the resource usage growth of jobs.
- Billing: Hadoop clusters have finite capacity, so in a multi-tenant environment it’s important to know the resources used by a product feature against its business value.
In this post, we'll go over White Elephant's architecture and showcase some of the visualizations it offers. While you're reading, feel free to head over to the GitHub page to check out the code and try it out yourself!
Here's a diagram outlining the White Elephant architecture:
There are three Hadoop Grids, A, B, and C, for which White Elephant will compute statistics as follows:
- Upload Task: a task that periodically runs on the Job Tracker for each grid and incrementally copies new log files into a Hadoop grid for analysis.
- Compute: a sequence of MapReduce jobs coordinated by a Job Executor parses the uploaded logs and computes aggregate statistics.
- Viewer: a viewer app incrementally loads the aggregate statistics, caches them locally, and exposes a web interface which can be used to slice and dice statistics for your Hadoop clusters.
Let’s go through a real use case: we've noticed an increase in cluster usage the last few months, but we don't know who is responsible. We can use White Elephant to investigate this issue.
The graph below shows a sample data set with aggregate hours used per week for a cluster over the last several months. You'll notice that since mid-January, weekly cluster usage increased by about 4k hours from a baseline of about 6k hours.
In the graph above, Aggregate selected is checked, so the data for all users is grouped together. Let’s instead look at a stacked graph of the top 20 users by unchecking Aggregate selected and setting Max to graph to 20.
Now we can see individual usage per week by the top 20 users. The remaining 46 users have been grouped together into a single metric. Several users stand out suspiciously in terms of cluster usage, so we'll dig deeper.
We can highlight one of these users by hovering over the username in the legend.
Using drag and drop we can rearrange the list so these users appear at the bottom.
Looks like 4 users have shown significant usage increases: User-1 and User-2 began increasing in mid-January, while User-43 and User-65 began a steady climb in usage around December.
Did we miss anyone? If we want to see what cluster usage would look like without these users we can deselect them in the legend.
Once we exclude these users, we can see that cluster usage has not significantly changed during this time period, so we've identified all of our culprits.
Let's drill down to just these four users. Users can be selected with a multi-select control. A filter makes it easy to search for particular users by name.
How do these 4 compare with everyone else? For convenience, the remaining users are aggregated together and included as well: just select the aggregate metric and move it to the top.
And there you have it: with White Elephant, we've tracked down the problem with ease thanks to the unprecedented visibility into our Hadoop usage. We even get a table presenting what data was queried from which we can export as a CSV.
White Elephant is open source and freely available here under the Apache 2 license. As always, we welcome contributions, so send us your pull requests.