First Apache release for Kafka is out!
January 6, 2012
We are pleased to announce the first release of Kafka from the Apache incubator. Kafka is a distributed, persistent, high throughput messaging system for collecting and delivering a high volume of data with low latency. The 0.7.0 release adds compression and mirroring features.
At LinkedIn, Kafka is primarily used for tracking the huge volume of activity events generated by the website. These activity events are critical for monitoring user engagement as well as improving relevancy in our data-driven products. Apart from activity tracking, we use Kafka to feed Hadoop for offline analytics, as well as a way to track internal operational metrics that feed graphs in real-time.
Incubator Progress
In July 2011, Kafka entered the Apache incubator with the goal of growing the Kafka community. Since then, we've seen around 10x growth on the community mailing list and have received several patches. In addition to this, several companies have adopted Kafka in their production pipeline. These deployments vary from news feed applications to email queuing systems.
Last week, we announced the first Apache release: 0.7.0. With this release, Kafka offers two major new features: compression and mirroring.
Compression
Kafka 0.7.0 offers end-to-end block compression. For a data pipeline that needs to send and receive messages at a high throughput, the bottleneck is not CPU, but network. This is particularly true for data pipelines that span multiple data centers. End-to-end compression support enables batched compression of data at the publisher, which then gets sent, stored at the servers and delivered in compressed format to the subscribers, where it is finally decompressed to be delivered to the application.
Mirroring
Besides compression, we also added mirroring support in Kafka. This feature allows you to easily setup a replica of another Kafka cluster. Mirroring is often used in cross data center scenarios, where inter data center communication latency limit the ability to send individual messages to clusters in multiple data centers. Mirroring sets up Kafka clusters as consumers of another cluster and enables a high throughput pipeline for maintaining real-time replicas in different data centers.
Future work
Going forward, we plan to do bug fix releases every ~2 months. Our next big feature release will enable in-built replication in Kafka. Stay tuned.