Kafka Ecosystem at LinkedIn

Joel Koshy

Software Engineer at Databricks

April 19, 2016

Apache Kafka is a highly scalable messaging system that plays a critical role as LinkedIn’s central data pipeline. Kafka was developed at LinkedIn back in 2010, and it currently handles more than 1.4 trillion messages per day across over 1400 brokers. Kafka’s strong durability and low latency have enabled us to use Kafka to power a number of newer mission-critical use cases at LinkedIn. These include replacing MySQL replication in Espresso with Kafka-based replication, Venice and supporting the next generation of Databus (which is under development).

As our Kafka usage continues to rapidly grow, we had to solve some significant problems to make all of this possible, so we developed an entire ecosystem around Kafka. In this post, I will summarize some of our solutions, which can be useful to others using adopted Kafka, and highlight several upcoming events where you can learn more.

The above figure does not fully capture the various data pipelines and topology at LinkedIn but serves to illustrate the key entities in LinkedIn’s Kafka deployment and how they interact together.

We run several clusters of Kafka brokers for different purposes in each data center. We have nearly 1400 brokers in our current deployments across LinkedIn that receive over two petabytes of data every week. We generally run off Apache Kafka trunk and cut a new internal release every quarter or so.

The mirror-maker enables us to make copies of clusters by consuming from a source cluster and producing into a target cluster. There are multiple mirroring pipelines that run both within data centers and across data centers. Todd Palino’s post summarizes how we use this to facilitate multi-colo for Kafka at LinkedIn.

We have standardized on Avro as the lingua franca within our data pipelines at LinkedIn. So each producer encodes Avro data, registers Avro schemas in the schema registry and embeds a schema-ID in each serialized message. Consumers fetch the schema corresponding to the ID from the schema registry service in order to deserialize the Avro messages. While there are multiple schema registry instances across our data centers, these are backed by a single (replicated) database that contains the schemas.

Kafka REST is a HTTP proxy that we provide for non-Java clients. Most of our Kafka clusters have an associated REST proxy. Kafka REST also serves as an official gateway for administrative operations on topics.

Kafka is self-service for the most part: users define their event schema and start producing to the topic. The Kafka broker automatically creates the topic with default configurations and partition counts. Finally, any consumer can consume the topic, making Kafka completely open-access.

As Kafka’s usage grows and new use-cases emerge, a number of limitations become apparent in the above approach. First, some topics require custom configurations that require special requests to Kafka SREs. Second, it is hard for most users to discover and examine metadata (e.g., byte-rate, audit completeness, schema history, etc.) about the topics that they own. Third, as Kafka incorporates various security features, certain topic owners may want to restrict access to their topics and manage ACLs for it on their own.

Nuage is the self-service portal for online data-infrastructure resources at LinkedIn, and we have recently worked with the Nuage team to add support for Kafka within Nuage. This offers a convenient place for users to manage their topics and associated metadata. Nuage delegates topic CRUD operations to Kafka REST which abstracts the nuances of Kafka’s administrative utilities.

The LiKafka producer wraps the open source producer, but also does schema registration, Avro encoding, auditing and supports large messages. The audit events count the number of events sent to each topic in 10 minute windows. Likewise, the consumer wraps the open source consumer and does schema lookups, Avro decoding and auditing.

The Kafka push job is used to ship enriched data from Hadoop into Kafka for consumption by online services. The push job runs on Hadoop clusters in our CORP environment and produces data into the CORP data-deployment Kafka cluster. A mirror maker copies this data into the PROD data-deployment cluster.

Gobblin is LinkedIn’s new ingestion framework and deprecates Camus, which was our previous Kafka to Hadoop bridge. It is basically a large Hadoop job that copies all the data in Kafka into Hadoop for offline processing.

This is a continuously running set of validation tests for Kafka deployments that we leverage to validate new releases as well as to monitor existing deployments. We currently monitor basic but critical metrics such as end-to-end latency and data loss. We envision that in the future we will use this framework in test clusters to also continuously test the correctness of administrative operations (such as partition reassignment) and even leverage a fault-injection framework such as Simoorg to ensure we are able to meet our availability SLAs even in the presence of various failures.

There are two key components within our audit trail infrastructure:

A Kafka auditor service that consumes and recounts all the data in a Kafka cluster and emits audit events with counts similar to the tracking producer. This allows us to reconcile counts on the Kafka cluster with the producer counts and detect data loss if any.
A Kafka audit verification service, which continuously monitors data completeness and provides a UI for visualizing the audit trail. This service consumes and inserts audit events into an audit DB and alerts when data is either delayed or lost. We use the audit DB to investigate alerts as they occur and precisely pin-point where the data was lost.

Burrow is an elegant answer to the tricky problem of monitoring Kafka consumer health and provides a comprehensive view of consumer status. It provides consumer lag checking as a service without the need for specifying thresholds. It monitors committed offsets for all consumers at topic-partition granularity and calculates the status of those consumers on demand.

Samza is LinkedIn’s stream processing platform that empowers users to get their stream processing jobs up and running in production as quickly as possible. The stream processing domain has been buzzing with activity and there are many open source systems which are doing similar things. Unlike other stream processing systems that focus on a very broad feature set, we concentrated on making Samza reliable, performant and operable at the scale of LinkedIn. Now that we have a lot of production workloads up and running, we can turn our attention to broadening the feature set. This earlier blog post goes into details of our production use-cases around relevance, analytics, site-monitoring, security, etc., as well as a number of newer features that we are working on.

If you are interested in learning more about our Kafka ecosystem, how we deploy and troubleshoot Kafka, and our newer features/use cases, we invite you to attend these upcoming talks:

April 26: Espresso database replication with Kafka @ Kafka summit: Espresso is LinkedIn's distributed document store which hosts some of our most important member data. Tom Quiggle will present why Espresso is switching from MySQL’s built-in replication mechanism to Kafka and how Espresso will leverage Kafka as the replication stream - a use case that puts Kafka’s durability and availability guarantees to the test!
April 26: More data centers, more problems @ Kafka summit: Todd Palino will talk about fundamental architectures for multi-datacenter and multi-tier Kafka clusters and give practical tips on how to monitor the entire ecosystem.
April 26: Kafkaesque days at LinkedIn in 2015 @ Kafka summit: Joel Koshy will deep-dive into some of the most difficult and prominent Kafka production issues that LinkedIn hit in 2015. This talk will go over each outage and its impact as well as approaches to detection, investigation and remediation.
May 10: Building a self-serve Kafka system @ Apache: Big Data: Joel Koshy will provide an in-depth look into what it takes to make Kafka a truly multi-tenant service by weaving together security, quotas, RESTful APIs and Nuage.
May 9: Secrets behind scaling stream processing applications @ Apache: Big Data: Navina Ramesh will describe Apache Samza’s approach for state-management and fault-tolerance and discuss how it can be effectively used to scale stateful stream processing applications.
June 28-30: Lambda-less Stream Processing @ Scale in LinkedIn @ Hadoop Summit: Yi Pan and Kartik Paramasivam will highlight Samza’s key strengths as a real-time stream processing platform with a hands-on overview of its usage at LinkedIn.

We hope to see you there!

Topics: Open Source Talent