Open sourcing Brooklin: Near real-time data streaming at scale
July 16, 2019
Editor's note: This blog has been updated.
Brooklin—a distributed service for streaming data in near real-time and at scale—has been running in production at LinkedIn since 2016, powering thousands of data streams and over 2 trillion messages per day. Today, we are pleased to announce the open-sourcing of Brooklin and that the source code is available in our Github repo!
At LinkedIn, our data infrastructure has been constantly evolving to satisfy the rising demands for scalable, low-latency data processing pipelines. Challenging as it is, moving massive amounts of data reliably at high rates was not the only problem we had to tackle. Supporting a rapidly increasing variety of data storage and messaging systems has proven to be an equally critical aspect of any viable solution. We built Brooklin to address our growing needs for a system that is capable of scaling both in terms of data volume and systems variance.
What is Brooklin?
Brooklin is a distributed system intended for streaming data across multiple different data stores and messaging systems with high reliability at scale. It exposes a set of abstractions that make it possible to extend its capabilities to support consuming and producing data to and from new systems by writing new Brooklin consumers and producers. At LinkedIn, we use Brooklin as the primary solution for streaming data across various stores (e.g., Espresso and Oracle) and messaging systems (e.g., Kafka, Azure Event Hubs, and AWS Kinesis).
Brooklin supports streaming data from a variety of sources to a variety of destinations (messaging systems and data stores)
There are two major categories of use cases for Brooklin: streaming bridge and change data capture.
Data can be spread across different environments (public cloud and company data centers), geo-locations, or different deployment groups. Typically, each environment adds additional complexities due to differences in access mechanisms, serialization formats, compliance, or security requirements. Brooklin can be used as a bridge to stream data across such environments. For example, Brooklin can move data between different cloud services (e.g., AWS Kinesis and Microsoft Azure), between different clusters within a data center, or even across data centers.
A hypothetical example of a single Brooklin cluster being used as a streaming bridge to move data from AWS Kinesis into Kafka and data from Kafka into Azure Event Hubs.
Because Brooklin is a dedicated service for streaming data across various environments, all of the complexities can be managed within a single service, allowing application developers to focus on processing the data and not on data movement. Additionally, this centralized, managed, and extensible framework enables organizations to enforce policies and facilitate data governance. For example, Brooklin can be configured to enforce company-wide policies, such as requiring that any data flowing in must be in JSON format, or any data flowing out must be encrypted.
Prior to Brooklin, we were using Kafka MirrorMaker (KMM) to mirror Kafka data from one Kafka cluster to another, but we were experiencing scaling issues with it. Since Brooklin was designed as a generic bridge for streaming data, we were able to easily add support for moving enormous amounts of Kafka data. This allowed LinkedIn to move away from KMM and consolidate our Kafka mirroring solution into Brooklin.
One of the largest use cases for Brooklin as a streaming bridge at LinkedIn is to mirror Kafka data between clusters and across data centers. Kafka is used heavily at LinkedIn to store all types of data, such as logging, tracking, metrics, and much more. We use Brooklin to aggregate this data across our data centers to make it easy to access in a centralized place. We also use Brooklin to move large amounts of Kafka data between LinkedIn and Azure.
A hypothetical example of Brooklin being used to aggregate Kafka data across two data centers, making it easy to access the entire data set from within any data center. A single Brooklin cluster in each data center can handle multiple source/destination pairs.
Brooklin’s solution for mirroring Kafka data has been tested at scale, as it has fully replaced Kafka MirrorMaker at LinkedIn, mirroring trillions of messages every day. This solution has been optimized for stability and operability, which were our major pain points with Kafka MirrorMaker. By building this Kafka mirroring solution on top of Brooklin, we were able to benefit from some of its key capabilities, which we’ll discuss in more detail below.
In the Kafka MirrorMaker deployment model, each cluster could only be configured to mirror data between two Kafka clusters. As a result, KMM users typically need to operate tens or even hundreds of separate KMM clusters, one for each pipeline; this has proven to be extremely difficult to manage. However, since Brooklin is designed to handle several independent data pipelines concurrently, we are able to use a single Brooklin cluster to keep multiple Kafka clusters in sync, thus reducing the operability complexities of maintaining hundreds of KMM clusters.
A hypothetical example of Kafka MirrorMaker (KMM) being used to aggregate Kafka data across two data centers. In contrast with the Brooklin mirroring topology, more KMM clusters are needed (one for each source/destination pair).
Dynamic provisioning and management
With Brooklin, creating new data pipelines (also known as datastreams) and modifying existing ones can be easily accomplished with just an HTTP call to a REST endpoint. For Kafka mirroring use cases, this endpoint makes it very easy to create new mirroring pipelines or modify existing pipelines’ mirroring allowlists without needing to change and deploy static configurations.
Although the mirroring pipelines can all coexist within the same cluster, Brooklin exposes the ability to control and configure each individually. For instance, it is possible to edit a pipeline’s mirroring allowlist or add more resources to the pipeline without impacting any of the others. Additionally, Brooklin allows for on-demand pausing and resuming of individual pipelines, which is useful when temporarily operating on or modifying a pipeline. For the Kafka mirroring use case, Brooklin supports pausing or resuming the entire pipeline, a single topic within the allowlist, or even a single topic partition.
Brooklin also exposes a diagnostics REST endpoint that enables on-demand querying of a datastream’s status. This API makes it easy to query the internal state of a pipeline, including any individual topic partition lag or errors. Since the diagnostics endpoint consolidates all findings from the entire Brooklin cluster, this is extremely useful for quickly diagnosing issues with a particular partition without needing to scan through log files.
Since it was intended as a replacement for Kafka MirrorMaker, Brooklin’s Kafka mirroring solution was optimized for stability and operability. As such, we have introduced some improvements that are unique to Kafka mirroring.
Most importantly, we strived for better failure isolation, so that errors with mirroring a specific partition or topic would not affect the entire pipeline or cluster, as it did with KMM. Brooklin has the ability to detect errors at a partition level and automatically pause mirroring of such problematic partitions. These auto-paused partitions can be auto-resumed after a configurable amount of time, which eliminates the need for manual intervention and is especially useful for transient errors. Meanwhile, processing of other partitions and pipelines is unaffected.
For improved mirroring latency and throughput, Brooklin Kafka mirroring can also run in flushless-produce mode, where the Kafka consumption progress is tracked at the partition level. Checkpointing is done for each partition instead of at the pipeline level. This allows Brooklin to avoid making expensive Kafka producer flush calls, which are synchronous blocking calls that can often stall the entire pipeline for several minutes.
By migrating all of LinkedIn’s Kafka MirrorMaker deployments over to Brooklin, we were able to reduce the number of mirroring clusters from hundreds to about a dozen. Leveraging Brooklin for Kafka mirroring purposes also allows us to iterate much faster, as we are continuously adding features and improvements.
Change data capture (CDC)
The second major category of use cases for Brooklin is change data capture. The objective in these cases is to stream database updates in the form of a low-latency change stream. For example, most of LinkedIn’s source-of-truth data (such as jobs, connections, and profile information) resides in various databases. Several applications are interested in knowing when a new job is posted, a new professional connection is made, or a member’s profile is updated. Instead of having each of these interested applications make expensive queries to the online database to detect these changes, Brooklin can stream these database updates in near real-time. One of the biggest advantages of using Brooklin to produce change data capture events is better resource isolation between the applications and the online stores. Applications can scale independently from the database, which avoids the risk of bringing down the database. Using Brooklin, we built change data capture solutions for Oracle, Espresso, and MySQL at LinkedIn; moreover, Brooklin’s extensible model facilitates writing new connectors to add CDC support for any database source.
Change-data capture can be used to capture updates as they are made to the online data source and propagate them to numerous applications for nearline processing. An example use case is a notifications service/application to listen to any profile updates, so that it can display the notification to every relevant user.
At times, applications may need a complete snapshot of the data store before consuming the incremental updates. This could happen when the application starts for the very first time or when it needs to re-process the entire dataset because of a change in the processing logic. Brooklin’s extensible connector model can support such use cases.
Many databases have transaction support, and for these sources, Brooklin connectors can ensure transaction boundaries are maintained.
For more information about Brooklin, including an overview of its architecture and capabilities, please check out our previous engineering blog post.
In Brooklin’s first release, we are pleased to introduce the Kafka mirroring feature, which you can test drive with simple instructions and scripts we provided. We are working on adding support for more sources and destinations to the project—stay tuned!
Have any questions? Please reach out to us on Gitter!
Brooklin has been running successfully for LinkedIn workloads since October 2016. It has replaced Databus as our change-capture solution for Espresso and Oracle sources and is our streaming bridge solution for moving data amongst Azure, AWS, and LinkedIn, including mirroring trillions of messages a day across our many Kafka clusters.
We are continuing to build connectors to support additional data sources (MySQL, Cosmos DB, Azure SQL) and destinations (Azure Blob storage, Kinesis, Cosmos DB, Couchbase). We also plan to add optimizations to Brooklin, such as the ability to auto-scale based on traffic needs, the ability to skip decompression and re-compression of messages in mirroring scenarios to improve throughput, and additional read and write optimizations.
We want to thank Kartik Paramasivam, Swee Lim, and Igor Perisic for supporting the Brooklin engineering team throughout our journey. We also are grateful to the Kafka, Samza, Nuage, and Gobblin teams for being wonderful partners. Last but not least, a huge shout out to the members of the Brooklin engineering team, who put their hearts and souls into this product and shaped what it is today.