Apache Samza: LinkedIn’s Stream Processing engine

Navina Ramesh

January 8, 2015

This post originally appeared as a contributed piece on The New Stack.

LinkedIn began processing “big data” on Apache Hadoop six years ago. As time passed, we recognized that some of our use cases couldn’t be implemented in Hadoop due to the large turn-around time that batch processing needed. We wanted our results to be calculated incrementally and available immediately at any time.

Around the same time, LinkedIn developed Apache Kafka, which is a low-latency distributed messaging system. Unlike Hadoop, which is optimized for throughput, Kafka is optimized for low-latency messaging. We built a processing system on top of Kafka to allow us to react to the messages- to join, filter, and count the messages. The new processing system, Apache Samza, solved our batch processing latency problem and has allowed us to process data in near real-time.

The Origin of Stream Processing

A few decades ago, there weren’t many internet-scale applications. With the emergence of the Web, N-Tier architectures became a common solution to increasing scale: The “presentation tier” (websites, desktop applications) processed only mandatory requests before transmitting the rest to a high-throughput queue referred to as a “middle tier.” Asynchronous (typically stateless) backend processes would then act on this “stream of events” and update relational databases in a backend tier.

Numerous application servers emerged to host the middle tier logic. The throughput of these early solutions was sufficient for the needs of the time. But as the scale for web applications grew exponentially, monolithic relational databases started giving way to scalable, partitioned, no-SQL databases and HDFS. Querying using Hive/Pig over petabytes of data in Hadoop replaced individual queries to relational databases. The era of big-data was here.

Not surprisingly, middle tier products also evolved to meet the needs of big-data ingestion. Apache Kafka and other distributed messaging products started supporting millions of messages/sec by spreading queue partitions over clusters of machines to scale message throughput linearly.

The increased scale and partitioned consumption model that all of these systems
necessitated created the need for a framework to easily process “streams of events” at scale.

Introducing Apache Samza

At LinkedIn, we created Apache Samza to solve various kinds of stream processing requirements in the company. The framework, originally open sourced by LinkedIn, helps you build applications to process feeds of messages. Samza has been an Apache incubator project since September 2013.

Samza's goal is to provide a lightweight framework for continuous data processing. Unlike batch processing systems such as Hadoop which typically has high-latency responses (sometimes hours), Samza continuously computes results as data arrives which makes sub-second response times possible.

Samza might help you to update databases, compute counts or other aggregations, transform messages, or any number of other operations. It's been in production at LinkedIn for several years and currently runs on hundreds of machines across multiple data centers. Our largest Samza job is processing over 1,000,000 messages per-second during peak traffic hours.

Architecture & Concepts

Streams and Jobs

Streams and Jobs are the building blocks of a Samza application.

A stream is composed of immutable sequences of messages of a similar type or category. In order to scale the system to handle large-scale data, we break down each stream into partitions. Within each partition, the sequence of messages is totally ordered and each message’s position is uniquely identified by its offset. At LinkedIn, streams are provided by Apache Kafka.

A job is the code that consumes and processes a set of input streams. In order to scale the throughput of the stream processor, jobs are broken into smaller units of execution called Tasks. Each task consumes data from one or more partitions for each of the job’s input streams. Since there is no defined ordering of messages across the partitions, it allows tasks to operate independently.

Samza assigns groups of tasks to be executed inside one or more containers - UNIX processes running a JVM that execute a set of Samza tasks for a single job. Samza’s container code is single threaded (when one task is processing a message, no other task in the container is active), and is responsible for managing the startup, execution, and shutdown of one or more tasks.

For more detailed explanation of Samza concepts, please refer here.

Architecture

Samza’s architecture is composed of 3 components:

A streaming layer – responsible for providing partitioned streams that are replicated and durable
An execution Layer - responsible for scheduling and coordinating tasks across the machines
A processing Layer - responsible for processing the input stream and applying transformations

The actual implementation of the streaming layer and execution layer is pluggable. Streaming implementation can be provided via any of the existing implementations: Kafka (topics) or Hadoop (a directory of files in HDFS) or a database (table). Similarly, systems like Apache YARN and Apache Mesos can be plugged-in for job execution systems. Samza has in-built support for Apache YARN and Apache Kafka.

For details on the integration of these components in Samza, refer to our architecure docs.

Fault Tolerance & Isolation

Samza provides fault tolerance by restarting containers that fail (potentially on another machine) and resuming processing of the stream. Samza resumes from the same offset by using “checkpoints”. The Samza container periodically checkpoints the current offset for each input stream partition that a task consumes. When the container starts up again after a failure, it looks for the most recent checkpoint and starts consuming messages from the checkpointed offsets. This guarantees at-least-once processing of messages.

One of the advantages of being tightly integrated with Kafka is that a failure in a downstream job does not cause back-pressure on upstream jobs. Normally, when a job fails, the job producing input for the failed job must decide what to do: it has messages that need to be sent, but the downstream job is not available to process them. It can drop the messages, block until downstream processing resumes, or store them locally until the job comes back. Samza avoids this problem by sending all messages between jobs to Kafka. This allows upstream jobs to continue processing at full speed without worrying about losing its output, even in scenarios where the jobs that are processing might be down. Messages are stored on Kafka brokers even when a job is unavailable.

By using independent tasks which process different partitions of the input streams and isolating the task execution by means of containers, Samza achieves process isolation and fault tolerance. Since each container is a separate UNIX process, the execution framework (that Samza integrates with) can easily migrate a process from one machine to another, incase any of the containers starts hogging the resources of a machine. Process isolation also means that when one job fails, it does not impact other jobs in the cluster. There are some caveats to this, which are documented here.

Stateful Stream Processing

One of the unique features that sets Samza apart from other stream processing systems is its built-in support for stateful stream processing. Some stream processing tasks are stateless and operate on one record at a time, but other uses such as counts, aggregation or joins over a window in the stream require state to be buffered in the system.

We can use a remote data-store to maintain state information. However, a remote-store based model does not scale. The message-processing rate of a stream task is much higher than the rate at which a DB processes requests. In addition, if a task fails, we cannot rollback mutations to the DB. This means that a task cannot recover with the correct state.

Samza solves this problem by bringing the data closer to the stream processor. Each Samza task has its own data-store that is co-located on the same machine as the task. This improves the read/write performance by many orders of magnitude (compared to a remote datastore). All writes to the local data-store are replicated to a durable change-log stream (typically, Kafka). When a machine fails, the task can consume the changelog stream to restore the contents of the local data-store to a consistent.

By allowing stateful tasks, Samza opens up the possibilities for sophisticated stream processing jobs: like joining input stream, aggregating groups of messages etc.
Samza allows different storage engines to be plugged in for stateful processing and currently, supports RocksDB out of the box.

Case Study: Call Graph Assembly at LinkedIn

LinkedIn uses a distributed service-oriented architecture for assembling pages as quickly possible while still being resilient to failures. Each bit of content is provided by separate services, each of which will often call upon other subsequent services in order to finish its work. Building a service call graph helps provide good insights into the site’s performance and Samza allows us to do this in real-time!
Consider a front-end service like Homepage that assembles content from multiple downstream services like People You May Know (PYMK), Pulse news, updates, relevant ads, etc. The initial request creates dozens of parallel REST requests to other services. We associate a GUID called treeID with the initial request. Every time one of these REST calls is made, the server sends a log of the request, along with the treeID.

A call graph would then look something like this:

We have built a Samza workflow around this: the Call Graph Assembly (CGA) pipeline. It consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion.

For more information on CGA, refer here.

Samza is also used for other use cases at LinkedIn such as site speed monitoring, data standardization, metrics/monitoring, etc. For more information on these use cases, refer here.

Why Choose Samza?

The stream processing space is very active right now. After Hadoop became
popular, developers began to realize that batch processing has limitations. There are many cases where the turnaround time is insufficient. Projects began to spring up to address low-latency asynchronous processing, and Samza is one such project.

Samza's most unique feature is its approach to managing processor state. Unlike other stream processing systems that read state remotely from a database, which can cause throughput problems and breaks isolation, Samza stores the state for each task locally on disk on the same machine that the Samza container is running on. This drastically improves read performance, which help make stream processing with state much easier. If you are doing stateful processing, Samza is likely a good fit for your use case.

In addition to stateful processing, Samza is tightly integrated with Apache Kafka. Most stream processing systems have a very light-weight concept of a stream, and use transport layers such as ZeroMQ, Netty, or raw TCP connections. They don’t require streams to be ordered, durable, replicated, etc. Instead, these features are bolted on top of the underlying protocol. This typically complicates the stream processing framework, and makes the architecture fragile in the presence of back pressure which is common when you run such jobs in production. Samza has a much stronger model for streams: they must be ordered, highly available, partitioned, and durable. These strong requirements fit perfectly with Kafka’s stream model, and allow Samza to push a lot of hard problems into the underlying stream layer. This heavy reliance on Kakfa, and Samza’s conscious effort to integrate with Kafka’s full feature set, makes Samza a great fit if your organization is already running Kafka.

What’s Next?

Samza has already had two Apache Incubator releases. We plan to graduate from Apache Incubator as a full-fledged project.
We're beginning to improve Samza's usability and performance. This work will include updating Samza's configuration API, automatically creating streams when required, and improving Samza's write-throughput.
The complete list of upcoming 0.9.0 features is available here.
If you’re interested in learning more about Samza, try Hello Samza, join the mailing lists, and close a few bugs!