InGraphs: Monitoring and Unexpected Artwork
August 3, 2017
Editor's note: This blog has been updated.
At LinkedIn, we have an internal tool for visualizing operational metrics that we call inGraphs. Since I started working for LinkedIn almost four years ago, I’ve been snapshotting inGraphs that I thought were interesting—the ones that had helped to solve a problem, demonstrated a particular pattern, told an interesting story, or just plain looked cool. In January 2016, my “stash” reached critical mass, and I decided to start publishing these inGraphs to an internal blog called “inGraph of the Week” (igotw). As the name suggests, once a week I post one or more inGraphs that I find noteworthy for any (or all) of the reasons noted above.
Given the widespread familiarity with inGraphs across LinkedIn Engineering, this internal blog has become incredibly popular at LinkedIn. One of the best parts of running igotw is that not only do our posts show off cool things that have been observed using our network monitoring systems, but the posts also encourage familiarity with the tool and knowledge sharing within LinkedIn.
From first post, the response I’ve received has been overwhelmingly positive. Often comments take the form of “Hey, this is awesome, you should post these publicly.” My response is generally “Yeah, I should! Maybe tomorrow…” Well, today is Tomorrow! Below are a handful of examples of the kinds of things I typically post. They have some sensitive information stripped out, but should still be demonstrative of some of the interesting aspects of inGraphs.
Solving problems using patterns
The primary function of inGraphs is to provide operational insight into LinkedIn services. Alongside logs, it is the go-to tool for resolving incidents "in the moment" as well as understanding what happened after the fact. One way of going about this is to look for deviations from a historical norm, and, frequently—in fact, so commonly that it's a recurring theme of igotw posts—there are specific patterns to look out for. One such pattern is The Plateau.
The Plateau is typically a negative pattern—one that you do not want to see. Let's explore an example of when you don't want to see it by taking a look at a couple of inGraphs:
These both depict metrics for a single service at LinkedIn over the same time period. The majority of LinkedIn's stack is written in Java, and in a Java world, garbage collection (GC) is a fact of life, so we have a fair bit of instrumentation around GC.
The first inGraph represents the single longest "stop-the-world" GC pause (given in seconds) that was seen in a one-minute period. The second inGraph shows the total amount of time spent in stop-the-world GC within a one-minute period. Sometimes these will jump above 60, but this is just jitter introduced by time-boxing them to one-minute intervals; the total time spent in a single GC may be longer than 60 seconds, but it makes sense for 60 seconds to be the maximum value that you can spend in garbage collection since there are only 60 seconds in a minute.
There are three things that I really like about these inGraphs:
While they may not describe how things broke, these inGraphs provide a clear and nearly-immediate signal that something is broken (and badly). We can (and do) use this signal to trigger alerts and escalate to someone to take a look.
They are significantly more straightforward to interpret than most of the other GC/heap metrics that we emit. GC/heap metrics can be valuable, but they can require some mental arithmetic in order to interpret. "Okay, so I did N GCs and on average they took M seconds, so that means..." You shouldn't need a calculator to figure out what is going on with your service. It kind of defeats the purpose.
They are created from out-of-band data. A script running on each host discovers the services running, parses the GC log for that service, and pumps out metrics. This can be super helpful if the service is hosed so that it's not properly firing GC sensors but is continuing to write to its GC log. As a colleague (Ben Weir) pointed out, this can also be super helpful when you change the garbage collector you're using. Switching from CMS to G1 may mean that inGraphs metrics names change, and you’ll need to update your dashboard to reflect that, but these metrics will continue to give a consistent representation of the amount of time spent in garbage collection, irrespective of the garbage collection strategy you've chosen.
Other interesting observations
Sometimes inGraphs can expose an interesting characteristic of a service that may not be directly related to a site issue/incident. One such example:
This inGraph shows latencies from one service to another as reported by the client in three different data centers: one on the West Coast (U.S.), one on the East Coast, and one (roughly) in the middle of the country. A key that might be handy for reference:
There is an interesting point of inflection around the middle of the inGraph, and that’s what this particular post was about. Some relevant information here helps to explain that point of inflection. During the time period shown, LinkedIn still had single-data-center services—services whose source of truth resided in a single data center. At the point of inflection in the graph, a failover of a single-data-center service occurred; the source of truth was moved from the East Coast data center to one in the middle of the country.
The reason that this is interesting is that this graph can be thought of as a basic demonstration of the speed of electricity and, by extension, the speed of light. In a vacuum, electromagnetic waves propagate at the speed of light. Through a cable, they propagate at an appreciable fraction of that—say, 0.5c to 0.9c. This is very fast, to be sure...but not infinitely fast. It takes some time to move all the way across the country, and as might be expected, it takes less time to travel halfway across the country.
Prior to the migration of the source of truth, latencies were low on the East Coast—the physical location of the data at the time. They were a little higher in the middle of the country and higher still on the west coast. After the migration, latencies dropped in the middle of the country and converged on a higher value on either of the coasts.
igotw posts aren’t just about solving site issues or finding interesting ways to demonstrate the laws of physics. Sometimes, they’re just about inGraphs that caught my eye and reminded me of something other than inGraphs. For instance, this one reminded me of a Chinese dragon:
This inGraph looks a little different because it’s an older one from a previous version of inGraphs. It represents Kafka consumer lag—how many messages the client needs to consume before it has “caught up”—but really the main reason I posted it is that it looks like a rainbow.
Visualization tools like inGraphs are useful for many practical purposes within an engineering organization, particularly when they are robust and widely-adopted. However, their usefulness doesn’t have to end there! Sharing and discussing visualizations, even ones that “just look cool,” can go a long way towards encouraging adoption, discussion and, yes, enjoyment of monitoring operational metrics.