Revisiting Burrow: Burrow 1.1

Todd Palino

Technical Leader in Reliability Engineering

May 15, 2018

Two and a half years ago, the Data Infrastructure SRE team at LinkedIn introduced Burrow, an advanced way to monitor Apache Kafka consumer clients, in a previous blog post. At the time, we knew it was a gap that needed to be filled, but we were still surprised by how quickly Burrow was adopted by many other companies. Over time, a number of issues were identified, and the lack of key features (such as deletion of topics from the Kafka cluster) became more of a problem. In the last few months, we have undertaken a project to completely rewrite the application, and as a result of those efforts, we are pleased to announce version 1.1!

At its heart, Burrow is still designed to be a system for monitoring consumers that does not rely on thresholds. It works by evaluating how the consumer group is handling each partition that it owns and providing a status indicator for that group, as well as each partition, stating whether it is in a good state, a degraded state, or has stopped entirely. This lets us monitor the status of even the largest consumer groups, handling thousands of topics without overwhelming engineers with alert fatigue.

Improved code quality

One of the largest problems with the initial Burrow code base was that it was difficult to maintain, both for us and for contributors. The code had no tests, which meant that any changes that were proposed needed to be tested internally at LinkedIn via a manual process. There was no other way for us to assure that changes to the code did not break. As you can imagine, this was slow and error-prone. And if the changes were to code that LinkedIn did not use, they were frequently flawed.

Tests were therefore an absolute requirement of the rewrite, which also informed the way the code would be structured. No longer was it acceptable to have 100-line functions with convoluted logic; everything needed to be broken down into less complex routines that could be checked for their desired behavior. This makes it much easier for contributors to understand the logic behind the code and make changes with a better understanding of the changes they are making.

Creating tests also allowed us to utilize continuous integration for the source repository. This lets both contributors and repository owners know when it is possible for a pull request to be merged, and when it needs more work. Not only do we run the tests against the code, but we now also utilize checks of code complexity, style, and overall test coverage. For repository owners, clicking the merge button is far less risky than it was previously. This will let us have much more community involvement as we move forwards.

Modular design

We also needed to recognize that Kafka has a varied ecosystem, meaning we needed to support new types of consumers and different ways of notifying users of problems. This led us to modularizing Burrow in a way that allows the contribution of code that has a well-defined interface for each component. This lets us create new consumer modules to support storing information in places other than Kafka's standard offsets topic. It also lets us link into external monitoring and alerting systems far more easily; adding a new notification method is now as easy as writing a small amount of code to tell Burrow how to send a message.

We didn't stop there, however. The module that is used for evaluating a consumer can be swapped out if someone would like to try a radically new way of evaluating consumers. We can also change how cluster and consumer information is stored, allowing for the ability to utilize an external database. Even the Kafka cluster module can be swapped out, allowing users to monitor something that isn't Kafka at all. As long as the cluster and consumer information can be coerced into a model of partitions and increasing offsets, Burrow can work with it. This will let Burrow easily adapt to whatever the "next big thing" is.

Feature updates

The biggest feature that we needed to address was how to handle the deletion of topics in Kafka clusters. Burrow originally just ignored deletion, which led to a lot of false alerts as the feature began to be used more. The way it was structured, it just wasn't possible to handle deletion of data in a simple manner, not to mention the fact that Burrow had no way of detecting that deletion had happened. The 1.0 rewrite makes handling of both the broker and consumer data much more consistent, however, and provides the methods that are needed for Burrow to not only detect that a topic no longer exists, but also to then remove that topic from any consumer data.

There were also issues with the way Burrow presented data. Because all of the data was driven by consumer offset commits, strange things happened when the consumer stopped working. The most obvious issue was that Burrow would stop reporting an accurate value for the lag of the consumer (i.e. the number of messages the consumer was behind). Lag was only updated when the consumer committed offsets, and when the consumer did not commit offsets, lag was not updated at all. While it made sense from the point of view of Burrow's code, it was non-intuitive for users.

Easier to run

There have also been numerous changes made to make it easier to run Burrow. For instance, we have updated the configuration to utilize Viper. This lets configuration files be specified in a variety of formats, such as TOML and JSON, rather than in a fixed INI-style format. Companies can now more easily match the configuration to their own systems and utilize a format that is more familiar. We've also made it easier to embed Burrow inside another application, which is useful for environments where it is necessary to wrap internal logic (whether for configuration, deployment, monitoring, or other purposes) around an open source application.

In addition to all the other changes, the rewrite gave us a chance to fix numerous bugs that had cropped up in the old version. This means that Burrow 1.1 is far more stable than its predecessor, a critical feature for a monitoring application. And all the dependencies have been updated, which means we can properly support the latest Kafka versions, including SSL and authenticated connections.

Miles to go

There is still a significant amount of work to be done in Burrow. Notably, the following things are on our radar for work soon:

Monitoring: Burrow needs metrics that describe its own operation.
Custom modules: New modules for consumers and notifiers still need to be part of the Burrow code repository. We want to be able to allow users to create custom modules without having to maintain their own fork of Burrow.
Improvements to consumer status evaluation: We are constantly updating how consumer status is evaluated to make it as correct and fast as possible.
Distributing the work: Burrow is quite efficient at handling multiple Kafka clusters with thousands of consumers in a single instance, but it would be nice to run it as a distributed application with multiple instances sharing the workload.

For community members who would like to get involved, we encourage you to check out the GitHub repository for Burrow and dive right in on writing some code. There are always open issues for new feature requests. We have also adopted Gitter for chat, whether it is to get some help with a problem or to discuss a new code contribution.

Topics: Open Source