Apache Kafka meetup during Hadoop Summit

May 28, 2014

Every year, LinkedIn hosts a meetup for Apache Kafka enthusiasts attending Hadoop Summit.

Kafka is a high throughput messaging system rethought as a horizontally scalable, fault-tolerant, low-latency commit log. Kafka was developed at LinkedIn a few years ago and is a top level Apache project with a growing community of more than 700 active members. LinkedIn has now moved virtually all its data flows to real-time structured logs (we capture more than 60 billion events per day) and Kafka is now used in production at numerous companies, where it forms a vital core of their data pipeline.

The meetup will be hosted at LinkedIn on Tuesday, June 3, from 6.30 - 9 pm in the Unite conference room, 2025 Stierlin Court, Mountain View. Space is limited and this meetup is very popular; we recommend that you RSVP at the Apache Kafka meetup page.

The agenda is as follows:

6.30 - 7 pm: Registration and Networking

7 - 7.30 pm: Operating Kafka at scale in production by Todd Palino and Clark Haskins (LinkedIn)

We will discuss Kafka from an Operations point of view, including use cases and tools LinkedIn has been developing to improve the management of deployed clusters. We'll also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.

7.30 - 8 pm: Secure Kafka by Rajasekar Elango (Salesforce)

Salesforce has built scalable, near real-time monitoring using Kafka. They have a cluster in production that collects metrics data from different applications through JMX or system metrics from machines. The metrics are then consumed and reported to graphing tools such as Graphite. Rajasekar will describe the architecture of the metrics reporting system and how cross data center traffic was secured by implementing mutual SSL authentication between the Kafka brokers and producers/consumers.

8 - 8.20 pm: Apache Samza: Large scale stream processing on Kafka by Chris Riccomini (LinkedIn)

Stream processing is an essential part of real-time data systems, such as news feeds, live search indexes, real-time analytics, metrics, and monitoring. But writing stream processes is still hard, especially when you're dealing with so much data that you have to distribute it across multiple machines. How can you keep the system running smoothly, even when machines fail and bugs occur?

Apache Samza is a new framework for writing scalable stream processing jobs. Like Hadoop and MapReduce for batch processing, it takes care of the hard parts of running your message-processing code on a distributed infrastructure, so that you can concentrate on writing your application using simple APIs. This talk will introduce Samza and cover a specific use case: how LinkedIn aggregates operational data from web services across its data centers.

8.30 - 9 pm: Q&A session with the committers of Kafka

Pizza and drinks will be served. Hope to see you there! Don't forget to RSVP!

Topics