Apache Samza: LinkedIn's Real-time Stream Processing Framework

September 16, 2013

We're excited to announce that we've open sourced Samza, LinkedIn's stream processing framework. It is now an incubator project with the Apache Software Foundation. Samza helps you build applications that process feeds of messages—update databases, compute counts and other aggregations, transform messages, and a lot more.

A lot of work has led up to this. Several years ago we open sourced Apache Kafka. We used Kafka to transition to a real-time architecture for our data flows. LinkedIn now has real-time multi-subscriber Kafka feeds for all activity data, operational metrics, service call traces, log data, as well as application messages—over 26 billion unique messages per day are written to hundreds of message feeds, and read by thousands of subscribers. These subscribers are a mixture of real-time services as well as our Hadoop ETL Pipeline. This allows us to make each data feed available for any subscribers that need it as well as mirror it into HDFS for offline processing.

From the beginning, Kafka was designed to support not just the transport of data, but also to provide the infrastructure primitives that would enable real-time data processing. Samza provides elastic, fault-tolerant processing on top of real-time feeds. A simple analogy to the batch domain (for those familiar with Hadoop) is that Kafka plays a role similar to HDFS and Samza a role similar to MapReduce.

The kind of processing that Samza enables is often called stream processing. This domain poses a number of interesting technical challenges. The expected time to get output from a stream process is usually much lower than batch processing, frequently in the sub-second range. Processors often accumulate a significant amount of state if they are aggregating events (counting events by member ID, for example), or trying to buffer or join streams over a window of time. Managing these demands in the face of machine failures is hard. Doing all of this at scale in a partitioned, distributed setting is even harder. Samza is a light weight framework that makes it easier for our engineers to process our real-time data feeds without having to worry about so many of these problems.

Samza builds on top of Hadoop's YARN framework, which is also the foundation of MapReduce in the most recent Hadoop releases. YARN handles the distributed execution of Samza processing across a shared cluster of machines, helping to manage CPU and memory usage across a multi-tenant cluster of machines. Our thanks goes out to the YARN team for building such a fantastic piece of infrastructure!

We chose to release Samza as an Apache project based on a good experience with the community involvement in Kafka and Helix, LinkedIn's previous donations to the ASF. We also think this helps signal our own interest in building a strong, broad-based development community for these projects beyond the LinkedIn committers. Patches accepted!

To get started learning about Samza, we have created some documentation that covers the basic concepts and motivation. The code is available here. We've also created a simple quick start that shows how to process the live wikipedia edit stream to help people get up and running.

One of our main motivations in pursuing open source is to get feedback from the rest of the world. So we would love to get feedback or comments on our project mailing list.