Kafkaesque Days at LinkedIn – Part 1

Joel Koshy

Software Engineer at Databricks

May 24, 2016

This is the first post in a blog series adapted from my talk at the inaugural Kafka Summit.

Apache Kafka is the backbone for various data pipelines and asynchronous messaging at LinkedIn and beyond. In fact, we were excited to learn at the Kafka Summit last month that a number of other big adopters like Netflix and Microsoft have hit similar levels of scale in terms of message volume and deployment sizes, firmly establishing Kafka in the Four Comma Club.

While Kafka is an extremely stable piece of infrastructure, we occasionally run into some very interesting bugs at these scales. In this post and subsequent posts in this series, I will dive into some of the most critical incidents that we encountered at LinkedIn in the past year. While their root causes were obvious in hindsight, the incidents appeared “Kafkaesque” — an adjective that has come to represent anything having a nightmarishly complex, bizarre or illogical quality — at first due to interplays between multiple bugs, unusual client behavior and gaps in monitoring. This post will review how we detected, investigated and remediated each of these issues and summarize some of the features in Kafka that will help eliminate or mitigate such incidents in the future.

The Kafka broker assigns monotonically increasing offsets to every message that it appends to a topic partition’s log. The fetch requests that consumers issue to consume messages specifies which offset to consume from. If a fetch request contains an offset that is out of bounds of the log it will receive an OffsetOutOfRange error. The consumer will then set its offset to either the latest or earliest valid offset according to its auto.offset.reset configuration. These resets can cause significant impact to applications — resetting to the earliest offset will cause duplicate consumption while resetting to the latest offset means potentially losing some messages that may have arrived between the offset reset and the next fetch.

Kafka consumers periodically checkpoint their consumer offsets (i.e., positions) in each topic-partition so that they can resume from the last checkpoint should the consumer need to restart. For example, this can happen if a consumer fails, or if more partitions are added to the topic and the partition distribution across consuming instances needs to change. In Kafka 0.8.2, we introduced support for Kafka-based consumer offset management. As of early 2015, this was still a relatively new feature and we occasionally saw offset resets. Troubleshooting these incidents turned out to be extremely tricky and resulted in various fixes in offset management, log compaction and monitoring.

Kafka brokers generally don’t maintain any metrics on client-side errors such as offset resets. However, there are several indicators that a rewind, which is when consumers fall back to the earliest offset, may have occurred. The clearest server-side indicator is a large outflux of messages from the Kafka cluster without a corresponding increase in influx as shown below:

The obvious client-side indicator is a sudden increase in consumer lag. In this particular incident, we noted a spike in the mirror maker’s lag:

Lag monitoring can be somewhat tricky however, since not all spikes are necessarily a cause for alarm. It is more important to monitor trends in lag which is exactly what our lag monitoring service (Burrow) does.

CRT is LinkedIn’s continuous deployment platform and we use it to monitor builds and deploy services. CRT sends out emails for various nomination or deployment events. In early July, a number of developers (myself included) reported duplicate CRT emails. These reports lined up with an offset reset on one of our mirroring pipelines which happened to mirror the Kafka topic that contains these deployment events.

One typical cause of offset resets is unclean leader election. This happens for example when the leader broker of a partition fails and none of the other replicas are fully caught up — i.e., an out-of-sync replica has to take over as the leader which effectively truncates the log. This means a consumer that was reading from the end of the log may suddenly go out of range. It is a good practice to monitor the cluster’s unclean leader election rate. However, we did not see any unclean leader elections during this incident which occurred in July:

However, the consumer logs clearly indicate that there were offset resets for several partitions. In order to debug this incident, it is necessary to understand how consumer offset management works.

Consumers commit offsets by sending the OffsetCommitRequest to a designated broker (called the offset manager). The request contains a list of partitions and consumer offsets (or positions) in those partitions. The offset manager appends keyed messages to a special __consumer_offsets topic. The key is composed of consumerGroup-topic-partition and the value is the offset. We also maintain an in-memory cache for the latest entry for a given key so that we can serve OffsetFetchRequests quickly without having to scan through the offsets topic log. If the offset manager fails for any reason, a new broker will become the offset manager and repopulate its offset cache by scanning through the offset topic. In the example below, the mirror-maker is at position 321 for partition 0 of the PageViewEvent topic.

The offsets topic is log-compacted, which means older redundant entries for a given key will eventually be purged. Since there are transient consumers (such as console consumers) that come and go, the offset manager periodically scans through the offset cache and removes offset entries for dead consumer groups (i.e., that have not been updated in a configurable staleness period). This process also appends a tombstone to the offsets topic log so that it will also be removed from the persistent log during the next run of log compaction.

Here are excerpts from the consumer log around the time of the reset:

There were several consumer rebalances preceding the reset. Rebalances typically occur when a consumer leaves or joins the consumer group, or if new topics or partitions become available for consumption. During a rebalance, consumers stop consuming, commit their offsets, redistribute partitions across the group, fetch offsets for newly-owned partitions and then resume consumption. In the above logs, the initOffset lines indicate the offsets from which they resume. The first two rebalances resumed from a valid offset but the third rebalance attempted to resume from what is clearly a very old offset. When debugging these incidents, it is often useful to take a dump of the offsets topic as soon as possible (since compaction may run at any time). You can dump the offsets topic with a console consumer:

(You must remember to set the exclude.internal.topics property to false since the offsets topic is an internal topic.)

The dump of the offsets topic gives further insight into what happened:

The bad offset (6811737) was checkpointed over a month prior to the incident. However, there were also recent (good) checkpoints as well that should have been returned on an offset fetch. So why did the offset manager return such a stale offset?

The offset manager logs below clearly indicate that there was an offset manager movement and a subsequent issue while loading offsets into the offset cache:

This was due to a bug that crept in while making a change to the offsets topic message format. The effect was that the cache loading process on offset manager movement quit shortly after inserting the earlier, stale offset into the cache. This is why the offset fetch returned the stale offset.

However, the mirror maker in this pipeline happened to set its auto.offset.reset configuration to “latest”, which in theory means there should never be duplicates. So we need to look closer at what happened with the specific topic that contains the deployment events:

Although this topic also exhibited the same pattern — i.e., the offset fetches after the earlier rebalances returned current offsets while the offset fetch for the last rebalance returned a stale offset — that stale offset happened to still be valid even though it was over a month old. This was initially surprising because we set retention to four days for topics in this Kafka cluster. The root cause here is that this topic (being a record of deployments) happened to have fairly low volume and the log segment never rolled over as it did not hit the size-based threshold for rolling. Time-based retention relies on the last modified timestamp of each segment which does not work very well for low-volume topics that don’t roll their segments very often.

The offset management bug was fixed a while ago, while the bigger issue of accuracy in time-based retention is currently being addressed in two major improvement proposals (KIP-32 and KIP-33).

This incident occurred in one of our data-deployment pipelines in which Hadoop push-jobs send data to Kafka in the CORP (i.e., non-production) environment which we then mirror to PROD for consumption by online services. We received lag alerts that clearly indicated that the mirror makers were stuck for some reason. A bounce cleared the issue but a number of duplicates were generated downstream which suggests an offset rewind.

The consumer logs in this instance showed that the offset fetch returned -1 (which means no offset was ever checkpointed). In other words, the offset manager had somehow lost previously checkpointed offsets:

The offset manager logs offer a clue as to what might have happened:

There was an offset manager movement and the offset cache load took 17 minutes. The offsets topic is a relatively small topic so a load should generally take only a few seconds. This clearly suggests that the log compaction process must have stopped for some reason, causing the offsets topic to grow unchecked. While this is concerning, it should not have resulted in any loss of checkpointed offset information; offset fetches during a load receive an error code that indicates that a load is in progress so a consumer just has to retry the offset fetch. In this instance, the offset fetch actually returned with the equivalent of “no valid checkpoint found.”

It turns out that since the log compaction process had quit a while ago, there were several much older offsets still in the offsets topic. The offset cache load process would have loaded these into the cache. This by itself is fine since the more recent offsets in the log would eventually overwrite the older entries. The problem is that the stale offset cleanup process kicked in during the long offset load and after clearing the old entries also appended tombstones at the end of the log. The offset load process in the meantime continued and loaded the most recent offsets into the cache, but only to remove those when it saw the tombstones. This explains why the offsets were effectively lost.

So this issue was again an interplay between two orthogonal issues. We made various improvements to make log compaction more resilient and also fixed the race condition with the cache cleanup process.

Kafka-based offset management is now the default offset management mechanism for Kafka consumers. It is useful to have a good understanding of how it works in order to use it properly and troubleshoot any issues. My earlier presentation provides a more detailed overview on how offset management works and some monitoring guidelines. A brief summary follows.

Consumer lag alerts are essential to monitoring consumer health and detecting offset rewinds. However, doing this correctly without encountering false positives is tricky. Burrow helps takes the guesswork out of this and offers a useful lag-monitoring service to every consumer in your organization. It is also critical to actively monitor log compaction metrics — especially the max-dirty-ratio sensor — in addition to the offset management metrics such as the offset-cache-size, commit-rate and group-count sensors. When investigating a suspected offset rewind it is sometimes useful to take a dump of the __consumer_offsets topic. You will also want to check for typical triggers such as unclean leader elections and carefully examine the offset manager and consumer logs.

Stay tuned for future posts about other incidents we hit in 2015 such as cluster unavailability and backward incompatible changes. In the next post, we will deep dive into a cluster outage that was triggered by a subtle performance issue in an internal library. The investigation of this incident yielded important insights into monitoring Kafka’s socket server health and optimizing request handling latencies.

The streams infrastructure group at LinkedIn is hiring software developers and site reliability engineers for Kafka, our stream processing platform (Samza) and our next generation change capture technology. Contact Kartik Paramasivam for details.

Topics: Culture Open Source