Kafka at LinkedIn: Current and Future

Mammad Zadeh

January 29, 2015

The LinkedIn engineering team has developed and built Apache Kafka into a powerful open source solution for managing streams of information. We use Kafka as the messaging backbone that helps the company’s applications work together in a loosely coupled manner. LinkedIn relies heavily on the scalability and reliability of Kafka and a surrounding ecosystem of both open source and internal components. We are continuing to invest in Kafka to ensure that our messaging backbone stays healthy as we ask more and more from it.

Use Cases at LinkedIn

Today, some of the common scenarios at LinkedIn that leverage Kafka include:

1. Monitoring : All hosts at LinkedIn emit metrics pertaining to their system and application health through Kafka. These are then collected and processed to create monitoring dashboards and alerts. A deeper read on this can be found here. In addition to standard metrics, a richer call graphs analysis is also done by leveraging Apache Samza for processing the events in real time.

2. Traditional messaging: Various applications at LinkedIn leverage Kafka as a traditional messaging system for standard queuing and pub-sub messaging. These applications range from Search, Content Feed and Relevance and they publish processed data into online data serving stores like Voldemort, etc.

3. Analytics: LinkedIn tracks data to better understand how our members use our products. Information such as which page got viewed and which content got clicked on are sent into a Kafka cluster in each data center. These events are all centrally collected and pushed onto our Hadoop grid for analysis and daily report generation.

4. As a building block (log) in various distributed applications/platforms: Kafka is also leveraged as a core building block (distributed log) by other products like our big data warehousing solution Pinot. We are also working on using Kafka as an internal replication and change propagation layer for our distributed database Espresso.

Kafka Ecosystem at LinkedIn

Apache Kafka has to be augmented with a set of components to enable several usage scenarios. At LinkedIn, the Kafka ecosystem comprises of the following set of components in addition to Apache Kafka.

1. MirrorMaker : This is an open source project that is used to move data between Kafka Clusters. In many situations we need the business logic to operate on events that are being generated in multiple data centers. MirrorMaker is used to aggregate these events across data centers.

2. A REST interface: This interface enables non-java applications to easily publish and consume messages from Kafka using a thin client model.

3. A schema registry: At LinkedIn we have, for the most part, standardized on AVRO for the event schemas. We have a layered API to send and receive AVRO events on top of the core KAFKA APIs. This API implicitly uses a Schema Registry service to serialize and deserialize the events as they are sent and received from Kafka.

4. Auditing service : Events get generated in one LinkedIn data center. However, they typically get moved to a different data center for a lot of the offline processing. During this move, it is important for the consuming application (e.g. map reduce jobs) to understand when it has received all events that were generated in a particular time window, so that it can begin the offline processing. The audit service, which is also built on top of Kafka, is used to solve this problem.

5. A bridge to push data from Hadoop into Kafka : Most of the data derived in our map-reduce clusters (Hadoop) is pushed back to our serving databases (like Voldemort) using this bridge that pushes data from Hadoop into Kafka.

The Future of Kafka at LinkedIn

As LinkedIn scales, we know we need to stay ahead of the growth curve and ensure that our messaging backbone stays healthy. As the company grows, we expect the number of teams and applications that use Kafka and the diversity of their needs to increase. Also, as the number of users (developers and site reliability engineers) increases we expect to see a bigger delta in how well the users of Kafka understand Kafka. As a result we need to make sure that simple errors can be avoided.

We are planning to focus on the following key areas of development in 2015:

1. Security : We need to add basic authentication and authorization capabilities to the Kafka broker. This has to cover both management and runtime operations. This is necessary for certain types of data even within a secure network to avoid human errors. A lot of this work has already started in the open source community.

2. Quotas : Given that we have a wide variety of applications leveraging the same Kafka cluster it is important to make sure that one application doesn’t accidentally end up using all of the system resources and negatively impact all of the applications on the Kafka cluster. In Kafka we have to worry about saturating the network card on each broker host and also top of rack switches. When applications start reading from the beginning of the Kafka log or just catching up after a prolonged pause, they can saturate the network. This type of situation can cause unintended consequences for a lot of other applications sharing the network. This will be an important area of focus for us.

3. Reliability and Availability : We pick up Kafka from the open source trunk. In the open source, there are a good number of contributors. As in any project with a lot of engineers a certain degree of rigor is required in terms of upgrade, failover and stress testing before we roll the bits out to production. We will continue to invest in this area. In addition, a lot of time is spent in finding and fixing issues that are found in our day to day usage of Kafka.

4. Core functionality : Over the past year we created a new set of APIs in Kafka to allow pipelining publish operations and hence better performance for publishing events into Kafka. Over the course of the next year we hope to also work towards a new consumer API. The new consumer API will essentially remove the dependency on Zookeeper for the Kafka client. This also enables us to have a complete security model.

In addition to these improvements to our thick client, we are investing heavily on our thin client for Kafka. Basically we want to make the REST interface to Kafka be first class with support for core set of SLAs and monitoring.

5. Being Cost Efficient: As the company scales it is obviously important to make sure that the Kafka broker clusters can scale. We currently reach up to 550 Billion events a day while reaching peaks of 8 million incoming messages a second and 32 million outgoing messages a second. Kafka is designed for this. There is, however, work that needs to be done to ensure that we don’t keep solving the scale problem by throwing more and more hardware at the problem.

6. New Initiatives: We have some new initiatives in our data infrastructure group, which would leverage Kafka in new scenarios. Specifically, our distributed database Espresso will start leveraging Kafka for replication between the primary Espresso storage nodes and their secondaries. Currently, LinkedIn mostly uses Kafka for throughput. These additional use cases are going to also require lower latencies. We will also have to measure the failover time for Kafka broker primaries a bit more closely and have to potentially tune it.

7. Improving Operability: We currently have many manual processes with Kafka. Our site reliability engineers (SREs) have created a lot of cool tools to help with this. But we need to get some of these things solved in Kafka itself. For example, when we move partitions we have to be careful not to saturate the network. Today this is done carefully by SREs. We need the software to take care of this to reduce human error. When a node is added it would be great if Kafka automatically moves the appropriate set of partitions while keeping the cluster balanced.

Collaboration with Open Source Community

A few months ago some of the original engineers on Apache Kafka departed LinkedIn to start Confluent, a startup focused on Kafka. We wish them well and are excited about the broadening of the Kafka community. We will continue to collaborate and work closely with all contributors to Apache Kafka to take it to the next level.

Kafka at LinkedIn: Current and Future

Mammad Zadeh

January 29, 2015

Use Cases at LinkedIn

Kafka Ecosystem at LinkedIn

The Future of Kafka at LinkedIn

Collaboration with Open Source Community

Topics

Project Magnet, providing push-based shuffle, now available in Apache Spark 3.2

Introducing and Open Sourcing Shaky