Samza Aeon: Latency Insights for Asynchronous One-Way Flows

Max Wolffe

Staff Software Engineer at Databricks

April 19, 2018

Co-authors: Max Wolffe and Akhilesh Gupta

Introduction

You can’t fix something if you don’t know there’s a problem. Measuring and tracking the latency of requests through your system is essential to identifying and resolving issues quickly. Many systems have built-in tooling which allows developers to monitor, set alerts on, and inspect the latency of a set of endpoints. But what if a request to your system begins a chain of events that moves through several other systems before completing? That request may return immediately and then continue processing asynchronously, therefore appearing on the standard monitoring to be a very quick method. How do you know how long such a flow actually takes? In order to measure these asynchronous one-way flows, we built a tool called Aeon using Samza, an open source stream processing system developed here at LinkedIn, which we use to gain latency insights across these flows. In this blog post, we’ll describe how we built Aeon and the importance of asynchronous delay monitoring.

The problem

Traditional request response metrics are fairly easy to measure. There is a clearly defined start time (when the request is received) and a clearly defined end time (when the last byte of the response is sent and the connection is closed).

But how do you get the latency for portions of a round-trip request? What if what you’re really interested in is the amount of time it took to transform the request before it made it to the backend? You might create a mental model of the request that looks like this:

…or what if your flow emits a message which is consumed and processed by another system?

In order to resolve these issues, we made use of LinkedIn’s near ubiquitous practice of emitting tracking events to Kafka to power a latency monitoring solution built with Samza.

Event selection

We started by determining what flow we wanted to measure and selected some start and end points that capture the flow we were interested in. Say, for example, that we want to understand the latency within LinkedIn of sending a push notification for a message to a member’s phone. Our start event should be when a request to create a message is received by the messaging platform; let’s call this our StartTrackingEvent. Our end event will be when our push notification platform publishes a push notification to Apple Push Notification Service (APNS) or Google Cloud Messaging (GCM); we’ll call this event EndTrackingEvent. We need to process all of these events, so we’ll emit them to a messaging queue, such as Apache Kafka, to be processed later. At LinkedIn, we use Kafka extensively for tracking and monitoring purposes.

Stream processing with Samza

Now that we’ve chosen our starting and ending events and are emitting them to Kafka, we need to be able to determine the latency between our starting event and our end event. To do that, we’ll use Samza, a stream processing framework built to consume events from Kafka, and will perform some computation over them. In this case, we’ll be using Samza to determine the time difference between our StartTrackingEvent and our EndTrackingEvent, the latency of our flow.

Joining events using Samza

In order to determine the latency for a particular push notification, we need to join a StartTrackingEvent for a single push notification with its corresponding EndTrackingEvent. To do so, we’ve written two jobs: a Partitioner, which partitions the events to ensure that the events we’re tracking as a single push notification end up on the same process, and a Joiner, which calculates the latency of those matched events.

The Partitioner
The first thing we want to do is ensure that we can match the StartTrackingEvent and EndTrackingEvent that make up a particular push notification within a single host. We do this with a Partitioner step. All events are initially consumed by Aeon in a random way, meaning there’s no guarantee that a StartTrackingEvent and EndTrackingEvent for one push notification end up being processed by the same machine.

Luckily, both of the events share some data in common, typically a key that a StartTrackingEvent will share with only one other EndTrackingEvent. We can use this key to partition our events so that they ultimately end up being processed by the same Joiner. In this case, each event has a pushId, so we can publish all events with the same pushId to the same message queue. By doing this, we ensure that the StartTrackingEvent with pushId A will be consumed by the same Joiner which consumes the EndTrackingEvent with pushId A.

This looks like the following:

The Joiner
The Joiner is responsible for calculating the latency of a single flow and publishing those events to our monitoring solution, InGraphs. It does so by maintaining an in-memory cache of all the StartTrackingEvents it has received and all the EndTrackingEvents it has received. Every time it consumes a new StartTrackingEvent, it checks to see if there is a matching EndTrackingEvent entry in the cache. If it finds one, it calculates the latency by subtracting the StartTrackingEvent emission time from the EndTrackingEvent emission time. It then takes that latency and emits it to InGraphs for real-time monitoring. In order to avoid filling up memory, we use a timed expiry to automatically remove cache entries after a few minutes. Consider the following example flow:

First event arrives:

Second event arrives:

Making events actionable using InGraphs

Thanks to InGraphs, when we emit these latency metrics, we’re able to overlay a week-over-week comparison to understand whether latencies have gone up or down in the last week. Because these graphs are updated in real time, teams that onboard onto Samza Aeon have a real-time monitoring solution for any asynchronous flow.

Below is an InGraph for an example flow that has been onboarded to Samza Aeon.

Samza Aeon

Here is what the complete system looks like:

How is Samza Aeon used to monitor systems?

Samza Aeon has given us much-needed visibility into the performance of some of the most critical systems at LinkedIn.

Push notifications: As detailed in this blog post, all emails and push notifications at LinkedIn go through a centralized notification service called ATC. Late last year, we occasionally experienced poor push notification performance. Because there were many portions of the pipeline that could have been contributing to the poor performance, we onboarded each portion of our push notification flow to Samza Aeon. Once we had the data, it was clear which portion was contributing the most to the latency. We quickly resolved the issue and implemented alerting to make sure that our internal SLAs were respected.

Real-time platform: Our first use case was to understand the latency of our new real-time delivery system. All events through the real-time platform are monitored, but we pay special attention to typing indicators, seen receipts, and message delivery to be sure that the messaging experience is meeting members’ expectations.

Making it easy

An internal monitoring tool is only useful if people actually use it. One of the lessons we learned from this project is that the easier a tool is to onboard onto, the more likely it is to be used widely. This platform had a few benefits going for it:

It solves an existing need.
It leverages systems which are already in use, namely, tracking at LinkedIn.
It has a simple onboarding procedure to pick up the events.
It’s maintained by the Realtime Team at LinkedIn, so partner teams don’t need to add any operational overhead.

Conclusion

In this post, we presented a method for measuring the latency of actions beyond the standard request response model. We demonstrated how we use tracking events to create our starting and ending markers, and how those events are processed and joined by Samza Aeon to calculate latencies. Finally, we showed how InGraphs makes those latencies actionable, and we shared some of the products which have been onboarded onto Samza Aeon. We hope you can apply some of these ideas in your own systems.

Acknowledgements

This project was possible thanks to the support of several teams and many engineers. We’d like to thank the following people for their contributions to this project: Swapnil Ghike, Yingkai Hu, Félix Pageau, Cameron Lee, and Haoyu Wang.

Topics: Analytics Open Source Data Streaming/Processing