Apache Giraph, a framework for large-scale graph processing on Hadoop, reaches 0.1 milestone
February 6, 2012
I am excited to announce that the first release of Apache Giraph has been approved by Giraph and the Apache Incubator's PMCs. Giraph is an implementation of Google's Pregel model for large-scale graph computation that is rapidly being developed within the Apache Software Foundation.
With this release, Giraph reaches its 0.1 milestone, which despite the low number, represents a significant amount of work since its entry into the Apache Incubator process. LinkedIn is eagerly evaluating Giraph to see how it can help improve the speed of many of our graph-intensive workflows. Check out the release notes for what is included in 0.1 and read on for more information about Giraph.
The limitations of MapReduce
It's true: everything is a network. Quite a few of these networks hold valuable insights just waiting to be mined. While it is possible to do processing on graphs with MapReduce, this approach is suboptimal for two reasons:
- MapReduce's view of the world as keys and values is not the greatest way to think of graphs and often requires a significant effort to pound graph-shaped problems into MapReduce-shaped solutions.
- Most graph algorithms involve repeatedly iterating over the graph states, which in a MapReduce world requires multiple chained jobs. This, in turn, requires the state to be loaded and saved between each iteration, operations that can easily dominate the runtime of the computation overall.
The Giraph approach
Giraph attempts to alleviate these limitations by providing a more natural way to model graph problems:
- Think like a vertex!
- Keep the graph state in memory during the whole of the algorithm, only writing out the final state (and possibly some optional checkpointing to save progress as we go).
Rather than implementing mapper and reducer classes, one implements a Vertex, which has a value and edges and is able to send and receive messages to other vertices in the graph as the computation iterates. This approach makes graph computations such as PageRank simple enough that we use it as our Hello World example.
Yahoo! donated the Giraph code to Apache Incubator nearly six months ago. Since that time, the nascent community has resolved nearly 100 issues. These issues have focused on improving memory usage, providing a better out-of-the-box experience for new users, and clarifying Giraph's evolving API.
During this time, the community has grown significantly as well, adding four new committers from around the world and seeing contributions from many more. The industry is well represented with contributions from Twitter, Facebook and LinkedIn, and we've also seen significant contributions from academia, with new committers from VU University Amsterdam, Technische Universität Berlin, and Korea University.
If you're interested in large-scale graph computation but have not yet contributed, we provide 'newbie' issues to help you get started with the mechanics of creating patches in Apache. Here at LinkedIn, we've devoted significant resources to making Giraph easier to use and able to handle ever-larger graphs. We're always interested in technology that can help us derive more value from our social graph for our users.
Giraph is still very much a work in progress and there is much to do to make it production ready at web scale. The Giraph team is looking to remove complexity involved in supporting multiple Hadoop versions, to improve the scalability of the current RPC system, and to take advantage of new YARN platform that was released as part of Hadoop 0.23. Additionally, we're hoping to provide more examples of graph algorithms implemented via Giraph to demonstrate its wide-ranging usefulness.
As a committer and PPMC member on Giraph, I'm looking forward to continuing to help this new open-source community grow. I encourage you to get involved: take Giraph for a spin and start contributing!