Recap: Improving Hadoop Performance by (up to) 1000x

January 2, 2012

Co-authored by Daniel Tunkelang.

Daniel Abadi recently visited LinkedIn and talked about "Improving Hadoop Performance by (up to) 1000x." Abadi is an assistant professor at Yale University and the Chief Scientist of Hadapt, a company he co-founded to commercialize the HadoopDB project. This post is a recap of Abadi's talk, which focused on how HadoopDB could combine the best of Hadoop and parallel databases to achieve dramatic performance improvements in data analytics tasks.

A Comparison of Approaches to Large-Scale Data Analysis

Abadi began by recounting one of the controversial research papers in recent database history, the SIGMOD 2009 paper entitled "A Comparison of Approaches to Large-Scale Data Analysis" in which he, Michael Stonebraker, and others compared Hadoop to parallel databases and concluded that each paradigm had its own advantages:

  • Hadoop: better fault tolerance and adaptation to slowdowns.
  • Parallel databases: better performance for workloads involving database joins.

At the time, many people concluded that the two approaches represented different tools for different problems, like a hammer and a drill.

Two years later, however, Abadi rejects this analogy. Instead, he sees the two approaches as having many core features in common: they are both scalable data processing systems that run over a cluster of commodity nodes, partition data across those nodes, and parallelize processing work across those nodes.

Abadi suggests that we don’t have a hammer and a drill, but rather two different varieties of drill. And what we need is one drill with the right variety of drill bits. Abadi offers HadoopDB as such a solution: it adds drill bits to Hadoop that make it high performing for a diverse set of analytics tasks.

Drill Bit #1: database storage layer

HadoopDB's first drill bit replaces HDFS as the underlying Hadoop storage layer with a database system, either row-based or column-based, optimized for relational queries. In either case, each node hosts a single database instance, while Hadoop tracks storage at the file/block level and maintains replication for fault tolerance.

HadoopDB then adds a query optimizer that accepts Hive queries, determines if each query can be run outside of Hadoop (i.e., handled by a single database instance), and then pushes as much processing as possible to the database storage layer.

Drill Bit #2: graph processing

HadoopDB’s second drill bit focuses on graph processing workloads. In this case, HadoopDB stores the graph as RDF (i.e., as a collection of <vertex, edge, vertex> triples), partitioning the triples across the Hadoop nodes. A naive hash-based partitioning of triples across Hadoop servers would be inefficient, since it often would not co-locate all triples needed to answer a query. Hence, HadoopDB takes a two-pronged approach to partitioning graphs in Hadoop:

  1. Use a standard partitioning algorithm that aims to minimize the number of edges that cross a partition.
  2. Replicate parts of the graph using an "n-hop" guarantee: for each vertex stored in a given partition, all vertexes up to n hops away are replicated on that partition (assuming they are not already mapped to it).

Then, as with relational data, HadoopDB pushes as much query processing as possible to the graph data storage layer. Abadi's experimental results run on the Lehigh University Benchmark showed a 1000x speedup over a standard Hadoop system.

Hadoop at LinkedIn

At LinkedIn, we not only use Hadoop extensively, but we also contribute significantly to its development, as well as to related open source projects such as Giraph, Pig, and Hive. Moreover, we use Hadoop to perform graph processing on a massive scale, such as scoring over a hundred billion relationships a day to compute People You May Know.

Given that our graph keeps growing and we keep finding more creative ways to use it, we are keenly interested in ways to perform large-scale graph computations efficiently - in particular, improving our offline analytics and bringing new functionality into the online setting. While we continue to use Hadoop directly, we are delighted to see efforts like HadoopDB innovating in this space.

It was a pleasure to host Daniel Abadi and we hope the over 250 people who attended enjoyed his talk as much as we did.