Open-Sourcing the LinkedIn Gradle Plugin and DSL for Apache Hadoop

August 13, 2015

I'm proud to announce that the Hadoop Dev Team at LinkedIn has open-sourced the LinkedIn Gradle Plugin for Apache Hadoop ("Hadoop Plugin"), which includes the LinkedIn Gradle DSL for Apache Hadoop ("Hadoop DSL"). You can get the Hadoop Plugin on Github today!

A couple of years ago, LinkedIn adopted Gradle as our primary build system. With Gradle, developers can easily extend the build system by defining their own plugins. We developed the Hadoop Plugin to help our Hadoop application developers more effectively build, test and deploy Hadoop applications. The Plugin includes the Hadoop DSL, a domain-specific language for specifying jobs and workflows for Hadoop workflow managers like Azkaban and Apache Oozie.

In particular, the Hadoop Plugin includes tasks that will help you more easily work with a number of Hadoop application frameworks. Since no one tool is perfect for every kind of job, Hadoop jobs at LinkedIn are written using a number of different application frameworks. The Hadoop Plugin enables developers to organize their Hadoop projects in a consistent fashion regardless of the particular tool they choose for the job.

Long before the Hadoop Plugin, Hadoop developers at LinkedIn had realized that writing individual Hadoop jobs was only part of the challenge in using Hadoop effectively. Most data-driven features that appear on LinkedIn are actually generated by processing pipelines that may consist of dozens of individual Hadoop jobs chained together into workflows managed in Azkaban or Oozie.

Understanding the relationships between jobs in a workflow and managing the workflow specification files became a challenge in itself. For example, it takes hundreds of job files to specify some of the big data processing workflows we run at LinkedIn. It became enough of a problem that engineers at the company wrote several home-grown tools to make it easier to manage their workflows. Since these tools were written using a mix of Ant, Maven and Ruby, they prevented LinkedIn from completing its company-wide migration to Gradle, and over time became increasingly fragile and difficult to maintain.

To solve these problems, we developed the Hadoop DSL, which is included with the Hadoop Plugin. The Hadoop DSL is an embedded Groovy domain-specific language with natural syntactic constructs for specifying jobs and workflows for Hadoop workflow managers.

  • Since it's an embedded Groovy DSL, you can use Groovy (or Java) anywhere throughout the DSL!
  • Using the DSL shields you from some of the painful details of creating Azkaban or Oozie workflow files.
  • The DSL is statically compiled into job and workflow files at build time. Since it's statically compiled, it can be statically checked! The static checker will catch a number of common problems with your workflow files at build time, rather than running your Hadoop workflow only to have it to error out hours later.

For a complete reference and to learn by example, take a look at the Hadoop DSL Language Reference.

The Hadoop Plugin and Hadoop DSL have been embraced as the standard way to develop Hadoop workflows at LinkedIn. If you are writing Hadoop jobs using Gradle as your build system, you should definitely consider using the Hadoop Plugin! It will save you time and energy in developing your Hadoop workflows.

We welcome contributions of all kinds including pull requests, bug reports, documentation enhancements and ideas or feedback!