Sizr: Visualizing HDFS utilization at LinkedIn

January 8, 2016

Co-authors: Vamshi Hardageri, Brian Jue

Sizr is an interactive visualization tool developed at LinkedIn for the Hadoop Distributed file system (HDFS). It provides insights into HDFS disk space and namespace utilization. It can forecast and track weekly, monthly and quarterly growth, and detect inefficient file storage. This post outlines the need for this tool, its architecture and components, and how we use Sizr at LinkedIn.

Need to measure HDFS utilization :

LinkedIn is a data-oriented company, relying heavily on Big Data processing systems like Hadoop to power many decisions and features for the consumer-facing applications. With a multitude of teams running their workflows on Hadoop, allocation and monitoring of HDFS space associated with these projects becomes a challenge. Time and time again we have had to deal with the issues arising from unprecedented growth in space usage. These issues often went unnoticed until the workflow broke or the Hadoop cluster underperformed due to over-utilization.

The Sizr tool is built from an operations and planning standpoint and can be used to gather insights into LinkedIn's HDFS space usage/growth on any cluster, with the ability to answers questions like:

What is the growth in namespace or disk space for a cluster or a particular dataset over a period of 90 days?
Given an HDFS directory, what is the nature of growth, whether seasonal or organic? Sizr allows us to forecast this behavior.
Which datasets are responsible for growth and are likely candidates for cleanup? This gives the HDFS users ample time to either make amends to how they store their data, perform cleanup, or request additional space in the case of organic growth.
Which datasets occupy the majority of the space? Is the storage efficient? If not, how does it compare to its peers?

Components of Sizr application

Browser: The HDFS browser gives the ability to navigate the file system and has the following two components:

Current snapshot of the file system in the form of a table, which presents information about the datasets under a certain HDFS path such as aggregated namespace (number of files/folders) and disk space information, permissions, ownership and last modified time.
Historical trend of disk space and namespace presented in the form of a stacked area chart, which displays how children datasets contribute to the overall utilization of a parent dataset. The trend is graphed over the previous 90 days, with the ability to zoom in and zoom out over the timeline.

Analyzer: This view provides usage reports and utilization breakdown of a particular space. Unique datasets are identified and reports are generated based on their historical usage information. There are two components to this:

Quota usage and forecasting: The graph indicates the disk space and namespace utilization over a period of last 90 days. Linear regression is used to forecast the usage for next 30 days based on past 90 days of usage. This is particularly useful for spaces which are bound by quota, where the graph helps determine organic growth and presents a logical proof for increase in quota.
Breakdown of utilization: This is a report that identifies the datasets under a space, shown in order by highest disk space and namespace utilization. The rate of increase or decrease in growth of utilization for every dataset is computed over a week, a month and a quarter. Datasets with high growth rates are highlighted and complemented with trend lines, which  give a quick overview of the growth pattern for these datasets.

Visualizer: This view provides a high level visual breakdown of the HDFS space on two variables - Size and Average File size. The block size in the treemap visualization indicates the actual disk space utilized, and the color scale indicates the average file size. Smaller average file size is a matter of concern as it indicates inefficient file storage.

Components of Sizr data processing and storage

1. ETL workflow: This workflow has two parts:

HDFS snapshot processing: HDFS snapshot is the snapshot of the entire HDFS space published on a daily basis. A pig script processes this snapshot and recursively extracts the disk space and the namespace information of every leaf node (bottom-level directory), along with the HDFS directory structure, and loads the data into a staging table.
Aggregation and Roll up: The aggregation (summing up the namespace, disk space, and modification time for each leaf node) and roll up (rolling up the information for all child directories of each parent directory until the top directory) is then performed in the staging area and stored into a MySQL date partitioned table.

2. Identifying unique datasets: A custom algorithm is applied to identify all datasets (with business value), from a pool of directories.

3. Forecasting: For every identified dataset, linear regression is applied to calculate the utilization for the next 30 days based on the past 90 days data.

Use cases

1. Identify cause for growth

Unexpected growth can be identified visually. The region highlighted in red represents a sudden increase in utilization around September 20th. On further drill down (image below), we can zero in on the exact dataset which is dataset2 that has caused this growth.

2. Identify quota breach and forecast utilization

Analyzer determines efficiency of space utilization. Notice the first circle indicates the utilization had hit the quota around mid September and hence an increase in quota was requested. The organic growth of this dataset as indicated by the forecast plot (second circle) shows it is estimated to reach its quota again by the end of November. This will help users gauge their space requirements proactively.

3. Identify the datasets responsible for the above growth

The above report lists all the unique datasets under the data space in consideration and highlights the ones with high growth rates. This gives the user a yardstick to identify which datasets are potential candidates for cleanup or further investigate on unexpected behavior.

4. Identify inefficient file storage

The treemap visualization provides an overview of both breakdown of space occupied and its corresponding efficiency. From the above chart, we can identify that subdir3 occupies the majority of space, but performs poorly on storage efficiency (many small files). On the other end, subdir8 does not occupy a lot of space, but it has the highest efficiency in file storage.

What’s next?

As HDFS and the adoption of the Hadoop ecosystem evolves to play an ever-more important role at LinkedIn, we see a necessity to track and measure operational efficiency. Some of the tools in the pipeline are:

Planner: This tool is planned for future release, and involves storage capacity planning for Hadoop clusters. It can be used to forecast growth on a current cluster, or model a new cluster based on the current utilization.
Intelligent platform: We believe that users will benefit from the ability of the system to predict the growth and detect anomalies within the collected metadata. We are also working towards making this a self-serve platform where users can subscribe to certain growth metrics, and get alerted in advance about potential breaches.

Topics: Open Source Infrastructure