Samza 1.0: Stream Processing at Massive Scale
November 27, 2018
We are pleased to announce today the release of Samza 1.0, a significant milestone in the history of the project. Apache Samza is a distributed stream processing framework that we developed at LinkedIn in 2013. Samza became a top-level Apache project in 2014. Fast-forward to 2018, and we currently have over 3,000 applications in production leveraging Samza at LinkedIn. The use-cases include detecting anomalies, combating fraud, monitoring performance, notifications, real-time analytics, and many more. Today, Samza integrates not only with Apache Kafka, but also with many other systems, including Azure EventHubs, Amazon Kinesis, HDFS, ElasticSearch, and Brooklin. Multiple companies like Slack, TripAdvisor, eBay, and Optimizely have adopted Samza.
We view Samza 1.0 as a step towards our vision of making stream processing universally accessible. In this post, we describe our journey in building and scaling a distributed stream processing system. We also present the key features in Samza 1.0: a rich high-level API, event-time-based processing, integration with Apache Beam, Samza SQL, a standalone mode to run Samza without YARN, and a new test framework for Samza applications.
Starting with a solid foundation
Like Apache Kafka, Samza has its roots at LinkedIn. Back in 2012, we standardized on Kafka as the transport mechanism for all tracking data. With several terabytes of data generated daily into Kafka, our applications needed to obtain insights from it. These applications dealt with a common set of problems as they consumed messages from Kafka, such as checkpointing, management of local state, handling failures, scaling-out processing, etc. Apache Samza was built to tackle these fundamental problems in stream processing.
Samza started out as a distributed, streaming-first execution engine with a simple API. We scaled applications by breaking them into multiple units of parallelism called “tasks.” Each task processed a subset of our input partitions and had its own local storage. We made local state fault-tolerant by replicating each write into a Kafka-based changelog. This allowed us to recover the data from the changelog in the event of failures.
We later built features like incremental checkpoints and host affinity to make large-scale stateful processing viable. This enabled us to push the envelope in stream processing, with a peak performance of 1.2M messages/sec on a single machine. From this baseline, we started in 2017 to make usability improvements to Samza, resulting in version 1.0. A highly stable and performant core provided a solid foundation for our efforts.
Layering a rich API on top
While stability has been one of Samza’s core strengths, we began to realize that our programming API was fairly low-level. Samza offered a simple callback-based API, allowing operations to be specified at the level of an individual message. Developers had to implement complex operations such as windows and joins by themselves on top of this API. In addition, multiple Samza jobs needed to be wired together using Kafka topics, with the output of one job feeding into another. This made building applications time consuming and error-prone.
To address this in Samza 1.0, we built a high-level API with built-in operators like map, filter, join, window, etc. This allows you to express complex data pipelines easily by combining multiple operators.
The high-level API works seamlessly for both YARN and stand-alone deployments, thereby enabling you to write code once and run it on any environment. Under the hood, it uses the same execution engine as the low-level API, inheriting all its performance benefits.
Ensuring portability with Apache Beam
Even though we added a native, high-level API in Samza, our application logic was not portable across execution engines. Additionally, stream processing was still confined to JVM-based languages, since Samza is written entirely in Java and Scala.
Apache Beam is an open source project that provides a unified API allowing pipelines to be ported across execution engines, including Samza, Spark, or Flink. It also allows for data processing in other languages, including Python, that are heavily used in the data science community.
As a part of Samza 1.0, we are happy to announce that applications written using the Beam API can be executed on Samza. Our goal here is to marry the power and portability of Apache Beam with the scale of Samza’s engine. With this integration, we are actively working to support Samza jobs in other languages, like Python. Furthermore, the integration allows Samza to use the ecosystem of connectors developed by the Beam community.
Supporting advanced event-time based windowing
Integrating with Apache Beam offered us another key benefit: it enabled Samza to generate accurate results independent of when the input data arrives. This is a hard problem for streaming engines because they often have to deal with incomplete data. Input data could arrive late or out-of-order for various reasons, such as delays in upstream pipelines (e.g., clients on a spotty network). With the Samza Beam integration, Samza supports event-time-based processing, including sliding, tumbling, and session windows on your streams. You can also specify sophisticated triggering criteria for when results should be emitted.
Enabling stream processing with SQL
Just like Hive made batch processing a lot more accessible compared to writing Map-Reduce jobs we believe Samza SQL will bring the benefits of declarative data processing to the streaming world.
Released as a part of Samza 1.0, Samza SQL empowers engineers to create streaming pipelines without a line of Java code. Engineers can now declaratively specify what they need and not worry about details like capacity provisioning, resource management, or operability. For instance, here’s the definition of a pipeline that reads from a Kafka topic of member profiles, filters those who are product managers at LinkedIn, and writes the results to a new topic.
Samza SQL also supports implementing custom user logic by specifying user-defined functions (UDFs) in Java. Our support for SQL leverages Apache Calcite for its implementation and builds on the foundations offered by Samza’s core engine.
Joining streams with tables made easy
Event-driven applications typically need access to additional data (in databases or in other REST-based services) to process their events. For example, consider a streaming pipeline that ranks notifications to be sent to LinkedIn members. Sending a notification requires access to the member’s profile—which devices they have the LinkedIn app installed on, their notification settings, etc. The “Samza Table API” simplifies scenarios like these where data in a stream needs to be joined with additional data from other sources. It provides common features like throttling and caching when accessing datasets.
The Table API also allows for composition—i.e., you can build composite tables by combining individual ones. For example, if you already have a table backed by a remote web service, you can add a Couchbase as a cache in front of it. At LinkedIn, we have added integrations with Couchbase, Espresso, and Rest.li. We are excited about the endless possibilities this unlocks in simplifying access to your data.
Features provided by the Table API include:
Throttling: Streaming systems can usually ingest and process messages at a high rate. Making remote calls to external services at the same rate when accessing datasets could bring them down. For this reason, Samza Tables enforce quotas on the client-side, allowing you to specify read and write limits for your services.
Caching: Samza Tables can also provide caching to further lower access latencies. In-memory and disk-backed options are currently offered for caching data locally. Alternatively, you can use a remote cache if your storage requirements are more than a single disk.
Async IO: When accessing remote tables or datasets exposed through web services, we can often issue non-blocking requests and improve overall throughput. Samza Tables natively support async-interactions when accessing remote sources.
Samza standalone: Bring your own Cluster Manager
Prior to Samza 1.0, Samza required YARN for resource management and distributed execution of applications. This worked well when running stream processing as a managed service on a YARN cluster. But as Samza gained momentum, our users desired the flexibility to run stream processing in any environment —Kubernetes, Mesos, or on the cloud. Samza 1.0 addresses this by offering a standalone mode of running applications.
This mode allows Samza to be embedded as a lightweight library within an application and run on any resource manager of your choice. You can increase parallelism by simply spinning up more instances of your application. The individual instances will then coordinate among themselves using Zookeeper to distribute their tasks. When an instance fails, its tasks are assigned to the remaining instances that are live.
The standalone mode does not yet support stateful stream processing like windowing and joins. We are actively working to address this by taking data locality into account when assigning tasks to hosts.
Improving testability of Samza applications
Testability is one of the key challenges when building any data processing framework. Samza users have typically tested their applications by spinning up a local Kafka cluster, producing a few messages to it, and verifying their output results by consuming from Kafka. This usually involved starting multiple components to set up the test environment. It also meant that the tests themselves ran for a longer duration.
Starting with the 1.0 release, we are excited to announce a new framework for unit-testing Samza applications. This is a significant step towards improving developer productivity. The framework allows you to provide inputs to your application using in-memory collections and run your logic through them. You can also run assertions on the contents of these collections and inspect results.
Samza is 1.0, but we are far from being done
With an ever-expanding list of use cases, we are at an exciting juncture in stream processing. While 1.0 is a significant milestone for the project, there is still a lot more to be done on improving our ease of use. It is a great time to be involved in the community. You can read up on Samza, check out our hello-samza tutorials, or even contribute some bug-fixes.
Here are some areas we are actively investing in:
Adding support for other languages, like Python
Hot-standby containers to support applications with strict downtime requirements
Making it easy to auto-scale and auto-tune Samza applications
Supporting machine learning related use cases
Enabling end-to-end exactly once processing
Want to work on similar problems in large-scale distributed systems? The Streams Infrastructure team at LinkedIn is hiring engineers at all levels!
This release would not have been possible without the valuable contributions from the Samza team at LinkedIn. Thanks to Xinyu Liu, Wei Song, Prateek Maheshwari, Yi Pan, Aditya Toomula, Hai Lu, Srinivasulu Punuru, Ray Matharu, Shanthoosh Venkataraman, Pawas Chhokra, Chris Pettitt, Deepthi Sridharan, Bharath Kumarasubramanian, Sanil Jain, Boris Shkolnik, Cameron Lee, Daniel Nishimura, Dong Lin, Weiqing Yang, Vignesh Gurumoorthy, Shenoda Gurgios, Gaurav Kulkarni, Celia Kung, Shun-ping Chiu, Fred Ji, Daniel Chen, Eric Pan, and Ahmed Abdul Hamid for all the contributions. Our thanks to the Samza SREs: Abhishek Shivanna, Jon Bringhurst, and Stephan Soileau, for operating the system and ensuring it stays highly available. Lastly, thanks to Jacob Maes, Samarth Shetty, Miguel Sanchez, Clark Haskins, and Kartik Paramasivam for their continued support and guidance.