Open Sourcing Kafka Monitor

Dong L.

May 25, 2016

Apache Kafka has become a standard messaging system for large-scale, streaming data. In companies like LinkedIn it is used as the backbone for various data pipelines and powers a variety of mission-critical services. It has become a core component of a company’s infrastructure that should be extremely robust, fault-tolerant and performant.

In the past, Kafka Site Reliability Engineers (SRE) have relied on metrics reported by Kafka servers (e.g. bytes-in rate, offline-partition-count, under-replicated-partition-count, etc.) to monitor the availability of a Kafka cluster. If any one of these metrics is unavailable, or if any metric’s value is abnormal, then something is probably going wrong and an SRE needs to step in to investigate the problem. However, deriving the availability of a Kafka cluster from these metrics is not as easy as it sounds — a low bytes-in or bytes-out rate does not necessarily tell us whether the cluster is available or not, and cannot provide a fine-grained measurement of the availability experienced by end users (say, in the event that only a subset of partitions goes offline). The capability to reliably and accurately measure the availability of Kafka clusters becomes increasingly important as our Kafka clusters grow larger and serve an increasing number of critical services.

In addition to monitoring availability, it is necessary to monitor trunk stability and catch any regressions in functionality or performance as early as possible. Apache Kafka includes a suite of unit tests and system tests in virtual machines to detect bugs. Yet we still see occasional bugs that do not manifest until Kafka has been deployed in a real cluster for days or even weeks. These bugs can cause a lot of operational overhead or even service disruption. Sometimes the issue is hard to reproduce, and SREs may need to roll back to an earlier version before developers can figure out the cause, which increases Kafka’s development and operation costs. In many cases, these bugs could have been detected earlier if we had tested Kafka deployment under a variety of failover scenarios, over prolonged durations and with production traffic.

Kafka Monitor is a framework for monitoring and testing Kafka deployments that helps address the above deficiencies by providing the capability to: (a) continuously monitor SLAs in production clusters and (b) continuously run regression tests in test clusters. We announced at the recent Kafka Summit that we have open sourced Kafka Monitor on Github. Moving forward we will continue to develop Kafka Monitor and use it as our de facto Kafka validation solution. We hope it can also benefit other companies who want to validate and monitor their own Kafka deployments.

Kafka Monitor makes it easy to develop and execute long-running Kafka-specific system tests in real clusters and to monitor existing Kafka deployment's SLAs provided by users.

Developers can create new tests by composing reusable modules to emulate various scenarios (e.g. GC pauses, broker hard-kills, rolling bounces, disk failures, etc.) and collect metrics; users can run Kafka Monitor tests that execute these scenarios at a user-defined schedule on a test cluster or production cluster and validate that Kafka still functions as expected in these scenarios. Kafka Monitor is modeled as manager for a collection of tests and services in order to achieve these goals.

A given Kafka Monitor instance runs in a single Java process and can spawn multiple tests/services in the same process. The diagram below demonstrates the relations between service, test and Kafka Monitor instance, as well as how Kafka Monitor interacts with a Kafka cluster and user.

A typical test would emulate a variety of scenarios at some pre-defined schedule which could involve starting some producers/consumers, reporting metrics and validating metrics against predefined assertions. For example, Kafka Monitor can start one producer, one consumer, and bounce a random broker (say, if it is monitoring a test cluster) every five minutes. Kafka Monitor can then measure the availability and message loss rate, and expose these via JMX metrics, which users can display on a health dashboard in real time. It can send alerts if message loss rate is larger than some threshold determined by the user’s specific availability model.

We encapsulate the logic of emulating periodic/long-running scenarios in services in order to facilitate composing tests easily from reusable modules. A service will spawn its own thread(s) to execute these scenarios and measure metrics. For example, we currently have the following services:

Produce service, which produces messages to Kafka and measures metrics such as produce rate and availability.
Consume service, which consumes messages from Kafka and measures metrics including message loss rate, message duplicate rate and end-to-end latency. This service depends on the produce service to provide messages that embed a message sequence number and timestamp.
Broker bounce service, which bounces a given broker at some pre-defined schedule.

A test is composed of services and validates various assertions over time. For example, we can create a test that includes one produce service, one consume service and one broker bounce service. The produce service and consume service will be configured to send and receive messages from the same topic. Then the test will validate that the message loss rate is constantly zero.

While all services within the same Kafka Monitor instance must run on the same physical machine, we can start multiple Kafka Monitor instances in different clusters that coordinate together to orchestrate a distributed end-to-end test. In the test described by the diagram below, we start two Kafka Monitor instances in two clusters. The first Kafka Monitor instance contains one produce service that produces to Kafka cluster 1. The message is then mirrored from cluster 1 to cluster 2. Finally the consume service in the second Kafka Monitor instance consumes messages from the same topic in cluster 2 and reports end-to-end latency of this cross-cluster pipeline.

In early 2016 we deployed Kafka Monitor to monitor availability and end-to-end latency of every Kafka cluster at LinkedIn. This project wiki goes into the details of how these metrics are measured. These basic but critical metrics have been extremely useful to actively monitor the SLAs provided by our Kafka cluster deployment.

As an earlier blog post explains, we have a client library that wraps around the vanilla Apache Kafka producer and consumer to provide various features that are not available in Apache Kafka such as Avro encoding, auditing and support for large messages. We also have a REST client that allows non-Java application to produce and consume from Kafka. It is important to validate the functionality of these client libraries with each new Kafka release. Kafka Monitor allows users to plug in custom client libraries to be used in its end-to-end workflow. We have deployed Kafka Monitor instances that use our wrapper client and REST client in tests, to validate that their performance and functionality meet the requirement for every new release of these client libraries and Apache Kafka.

We generally run off Apache Kafka trunk and cut a new internal release every quarter or so to pick up new features from Apache Kafka. A significant benefit of running off trunk is that deploying Kafka in LinkedIn’s production cluster has often detected problems in Apache Kafka trunk that can be fixed before official Apache Kafka releases.

Given the risk of running off Apache Kafka trunk, we take extra care to certify every internal release in a test cluster—which accepts traffic mirrored from production cluster(s)—for a few weeks before deploying the new release in production. For example, we do rolling bounces or hard kill brokers, while checking JMX metrics to verify that there is exactly one controller and no offline partitions, in order to validate Kafka’s availability under failover scenarios. In the past, these steps were manual, which is very time-consuming and doesn’t scale well with the number of events and types of scenarios we want to test. We are switching to Kafka Monitor to automate this process and cover more failover scenarios on a continual basis.

Kafka Monitor is potentially useful to other companies to validate their own client libraries and Kafka clusters. Indeed, Microsoft has an open-source project on Github that also monitors availability and end-to-end latency for Kafka clusters. Similarly, in this blog post, Netflix describes a monitoring service which sends continuous heartbeat messages and measures the latency of these messages. Kafka Monitor differentiates itself by focusing on extensibility, modularity and support for custom client libraries and scenarios.

The source code for Kafka Monitor is available on Github under Apache 2.0 License. Instructions for usage, design and future work are documented in the README and the project wiki. We would love to hear your feedback on the project.

While Kafka Monitor is designed to be a framework for testing and monitoring Kafka deployment, we have implemented one basic but useful test that you can use out of the box to monitor your Kafka deployment. This test measures availability, end-to-end latency, message loss rate and message duplicate rate by running one producer and one consumer to produce/consume from the same topic. You can view the metrics in the terminal, programmatically fetch their values using HTTP GET request, or even view their values over time on a simple (quick-start) GUI as shown in the screenshots below. Please refer to the project website for instructions on how to run this test and view various metrics.

There are a number of improvements that we plan to work on to make Kafka Monitor even more useful.

Additional Test Scenarios

Apache Kafka contains an extensive suite of system tests that run on every check-in. We plan to implement similar tests in Kafka Monitor and deploy them in LinkedIn’s test cluster and have these tests run continuously. This will allow us to catch performance regressions as they happen and validate the functionality of features such as quotas, admin operations, authorization, etc.

Integration with Graphite and Similar Frameworks

It is useful for users to be able to view all Kafka-related metrics from a single web service in their organization. We plan to improve the existing reporter service in Kafka Monitor so that users can export Kafka Monitor metrics to Graphite or other metrics frameworks of their choice.

Integration with Fault Injection Frameworks

We also plan to integrate Kafka Monitor with a fault injection framework (called Simoorg) to test Kafka under a more comprehensive collection of failover scenarios, such as disk failure and data corruption.

Kafka Monitor has been designed and implemented thanks to the efforts of the Kafka team at LinkedIn.

The streams infrastructure group at LinkedIn is hiring software developers and site reliability engineers for Kafka, our stream processing platform (Samza) and our next generation change capture technology. Contact Kartik Paramasivam for details.

Topics: Developer Experience/Productivity Open Source