Building the Activity Graph, Part I
June 12, 2017
Serving a feed of relevant, personalized content to 500 million members is a massive undertaking. Accordingly, our feed infrastructure is constantly evolving to take advantage of new relevance models, new features, and more efficient ways of scaling our infrastructure. In this post, we describe the Activity Graph, a new system that allows us to understand deep relationships between members’ content.
The origin of the Activity Graph
The story of the almost year-long project behind LinkedIn’s Activity Graph begins with a bug report, as things usually go. We noticed that sometimes, sponsored content (i.e., an ad) would show up in the first position in a member’s feed. This is against our internal best practices and something we actively try to avoid; we want the most interesting organic content to be the first thing a member sees, not an ad.
Speaking of “organic content,” let’s define it: it consists of the pieces of member-generated content in the feed, which we call “Activities.” An Activity is defined by three main components: Actor, Verb, and Object. An example in prose would be “Val shared a text post,” or “Vivek liked a comment.” We present these Activities as cards in the feed UI.
The root cause of an ad showing up as the first piece of content was that we did actually prepare an organic Activity for the first slot, but that Activity was dropped later on (during a process we call “decoration”) because it was marked by our systems as spam. So the problem stemmed from spam being removed at the last moment before the feed was displayed to the member.
But it wasn’t just spam organic content we had to worry about. Our policies around spam were changing to introduce the concept of “low-quality” (LQ)—content that’s not quite spam, but that most LinkedIn members wouldn’t want to see. You can read more about how we classify this kind of content in another post about various strategies for keeping the feed relevant for our members. We needed a way to support these new business rules, and dropping LQ content during decoration (in addition to dropping spam) made the ad-in-first-place issue worse.
This also created other problems beyond a poor member experience; our relevance teams train their machine learning models based on member interactions with the feed, and those models need accurate data. If an Activity is dropped during the decoration step, what the member actually sees when the page is rendered does not match what the model thinks they saw.
The decoration step
It’s important to understand what decoration means for the feed before we continue. The lower layers of the feed stack (the FollowFeed system) deal with identifiers for Activities; an example would be urn:li:activity:123. So when FollowFeed recommends a list of Activities, it will send back a list of URNs. Near the top of the stack, the URNs need to be resolved so that the full data for each Activity is available; but the Activity data itself could be referencing yet more URNs, and those too need to be resolved as a result. This step of recursive URN resolution is called decoration (“deco”).
If the decoration system sees that any URN that can be transitively reached during decoration has some spam/LQ state attached, it refuses to decorate the top-level record. This is meant to be a safety net.
The solution seems obvious: don’t serve spam/LQ content up the stack in the first place!
Making FollowFeed understand spam and LQ content
LinkedIn’s organic feed is served by FollowFeed. Historically, it has relied on decoration to drop spam and LQ content. To help keep spam/LQ content from even reaching the decoration step, FollowFeed’s indexing system needed to know when an Activity was spam/LQ by ingesting events from LinkedIn’s spam classifiers. But there’s a little bit of nuance here: since decoration drops the top record if it sees any transitive URN is spam/LQ, we need to do the same thing, but at the indexing layer too.
And now we reach the really hard part: we need to build a graph of all the Activities and how they relate to each other and all the other URNs that they reference. Without this, we just aren’t aware of the transitive relationships between URNs.
Let’s start with an example subgraph:
If article:1 is marked as spam, we want to ensure that FollowFeed won’t serve any of the Activities in the subgraph. There are multiple components needed to make that happen, not the least of which is building the graph in the first place.
Storing the nodes and the edges
At LinkedIn, our main storage technology is called Espresso, a distributed NoSQL document database with a very convenient REST API. We needed to model the graph on top of this document store.
How do we store the nodes? Since Espresso talks REST, we take a URN and hash it with MurmurHash3 (128 bits to avoid collisions), which, when converted to hexadecimal, produces a nice ASCII resource key we can use in URLs. It’s also of a consistent length, which is very useful because Espresso has limits on key lengths and some content URNs can be very long.
But how do we store the edges between nodes? Initially, we wanted to go with the simple and obvious approach of just storing a list of child URNs inside each node. That didn’t work because Espresso has limits on record sizes and some nodes have tens of thousands of children (if not more) which makes the records too big. So this approach doesn’t scale.
To solve this, we use a nifty Espresso feature: subresources. If you create records for keys, like so:
you can then query /parent1 and get a list of all subresources through pagination. This enables the listing of all the children for popular nodes without overwhelming the storage layer. An additional (and rather critical) benefit is that all the subresources are stored in the same Espresso shard as the parent resource, giving us consistency guarantees.
This is what we use to store the graph edges. Here’s an example that pretends the keys are raw URNs instead of hashes:
The RECORD_DATA record is where we store the data for urn:li:article:1. The other subresource records store only the raw child URN; the mere existence of the subresource signifies a parent-child relationship.
Note that the record data for, say, urn:li:activity:1 is stored in its own RECORD_DATA record.
Preventing cycles in the graph
The above data representation makes the algorithm that uses it rather obvious; when we need to propagate data down the graph, we do so in a breadth-first manner. We process all the children for a node, then all the children for those children and so on until we exhaust the subgraph.
But what happens when we get a cycle in the graph? The answer is “a major production issue” because the propagation system gets stuck in a loop and the feed indexing system grinds to a halt. This wasn’t very fun at all.
It should be pretty clear that no cycles should exist in the feed content graph in the first place. Why would there be? It’s not as if a Like on a Comment on a Share of an Article can somehow have that Article list the Like as a parent! But alas, these things do happen with real data; not because such a relationship makes any sense whatsoever, but because the systems that build the underlying data from which we build the graph aren’t perfect. Bugs happen.
So cycles, while rare, do exist. It’s not feasible to clean them all up (identifying all of them in the first place is very hard), and even if we could, that doesn’t protect us from future bugs. It’s much easier to build strong validation checks. We may not be able to enforce sanity on all the data coming into the graph-building system, but we can check whether adding a new edge would create a cycle and then abort the operation.
Such a check requires another piece of data.
PathsToRoots and why we need it
To be able to detect that adding child B to parent A will produce a cycle, we need to know what all the ancestors of B are. If A is already an ancestor of B, then adding such an edge creates a cycle.
If we had a fancy graph database which we could efficiently query for all the ancestors of a node, we would just use that. But we don’t have that—we have a document store. So we need to pre-compute and store this data. In database terms, we need a materialized view.
We’ve implemented this with a field called PathsToRoots, present in each graph node. It stores a list of lists of URNs, where each list stores a path of URNs from that node to a single reachable root node. There’s a list for every reachable root.
For instance, the PathsToRoots for activity:7 in the above graph is:
[[activity:5, activity:2, article:1],
[activity:5, activity:2, image:1]]
We of course need infrastructure to ensure this field is always correct, which isn’t easy in a distributed system. Building a consistent, atomic, and race-condition free graph storage layer on top of a distributed document database that doesn’t support cross-shard transactions or row-level locks is…challenging. This took months of work.
Propagation with materialized views
Everything up to now has been about building the graph of connections between Activities and URNs they reference (directly or indirectly). But we built the graph to propagate information through it.
Consider what happens when activity:2 (above) gets marked as spam. We want that information to reach all the nodes in the subgraph beneath that node. That’s the high-level view, but the reality is more complex.
There are two approaches here: the first one is to store this transitive state inside the graph node itself, thus in a way building another materialized view for all the “inherited” data. We could then update this inherited data when changes happen and also inform anyone interested in this with events on a Kafka feed.
We actually implemented this! Unfortunately, it came with several issues, not the least of which was that we ended up needing a copy of these graph relationships in the serving layer too (for various reasons), so all this materialization work was rather pointless.
A better design is to leverage the above PathsToRoots data at serving time. To explain this, we need a little detour into how FollowFeed works.
FollowFeed stores Activities and gets queries like “for a given feed-viewing member X who is connected to members A, B, and C, what should the feed for member X look like?”
FollowFeed uses a standard broker-searcher architecture. Stateless broker nodes receive requests and fan them out to stateful searcher nodes, where each searcher has a segment of the feed index. The index itself is a mapping of member ID to a time-ordered list of Activities that a member has created through their actions. The searcher nodes then rank recent Activities created by all the members the feed-viewing member is connected to and send a subset of those Activities back to the broker handling the feed request.
It’s important to understand the consequence of every searcher node hosting a segment of the index: a searcher node knows only about the Activities that are part of its segment! So the searcher node storing activity:7 may not know anything about activity:5, activity:2 or any other URN “above” it in the graph; all of that data could be part of a different index segment, served by a different searcher.
These searcher nodes are updated with information from the feed indexing system, which sends them data over Kafka. It is this same indexing system that is building and using the Activity Graph.
Propagation, now with PathsToRoots
Propagation for the Activity Graph has a very specific meaning: given an Activity X and some feature data associated with it or any ancestor of it, that feature data needs to be available to all the machines interested in Activity X.
That’s a mouthful, so let’s unpack it into two specific use-cases. Going back to our example graph, what happens when activity:2 gets new feature data that says the Activity is spam? We need to make sure that all the searcher nodes hosting any of the Activities in this subgraph have that piece of information. You can think of this as walking “down” the graph.
We implement this by recursively walking through the graph, using a Kafka work queue for durability. We discover all the descendent nodes, and for each one, we map that node to a searcher machine. We then send the machine the feature data for the ancestor over Kafka.
The second use case is when activity:2 (which is already marked as spam) gets a new Activity for comment:4. The searcher node hosting the Activity for comment:4 needs to have the feature data for activity:2 which says that it is spam. The same goes for any other ancestor node of the Activity for comment:4. You can think of this as walking “up” the graph.
Since we only store child pointers, we use our PathsToRoots field to figure out all the ancestor nodes of the Activity for comment:4. The feed indexing system collects the feature data for all the ancestors and sends it (over Kafka) to the searcher node hosting the new Activity.
The extra nuance is that aside from feature data, we also send each node’s PathsToRoots to the searcher machine.
Filtering and ranking
Now that we know every searcher node has all the information it needs, we can correctly implement Activity filtering and ranking. Whenever we need to filter or rank an Activity, we locally fetch the PathsToRoots for it. With that, we can locally fetch the feature data for every ancestor and know that it will be present on the same searcher machine (the propagation system ensures this).
We can now drop an Activity in the searcher nodes if any of its ancestors have a spam state. We can also use this data during ranking for better feed relevance. No more Activity drops at decoration!
This system is fully deployed to production. We have seen Activity rejection at decoration time because of spam/LQ drop to zero. We’ve also enabled more flexibility around handling of LQ during ranking.
Part 2 in this series will talk about several things: the changes we made to our searcher nodes to efficiently store and query the new feature data and how we intend to leverage the new Activity Graph for other use-cases at LinkedIn.