Auditing content features in FollowFeed

Banu Muthukumar

Staff Software Engineer at LinkedIn

August 27, 2019

The LinkedIn feed relies on a ranked list of the most relevant content for a member. More than 80% of the feed is organic content created by people, companies, or groups that a member is following; the rest consists of recommendations such as jobs, articles, or ads. All organic content in the Linkedin feed is powered by FollowFeed. FollowFeed has two main responsibilities: the efficient indexing of content and the optimal storage of the attributes and metadata of the content (also known as features in this blog) to be used for ranking in proximity to the data for efficient relevance scoring.

FollowFeed ingests activities (content created by members) and features through Kafka. When a LinkedIn member creates an activity, a Kafka event is consumed by FollowFeed with various details about the activity to be indexed. Simultaneously, features are generated for the activity from various sources. Examples include language, hash tags, author geolocation, etc. These features are ingested by FollowFeed through Kafka from various sources.

On the serving side, FollowFeed uses broker nodes to fan out the request to multiple storage nodes using a scatter-gather pattern. The storage nodes are partitioned by member/author ID and store a time-sorted list of activities created by them. The storage nodes also store features relevant to the activities for ranking. The storage nodes rank the activities and send back the top activities along with their features to a federation layer for the next stage of ranking. It is important to note that each storage node only contains the features related to the activities indexed by it.

FollowFeed flow for ingesting Activities and Features

Typically, it is not enough to store just the features associated with a particular activity. We also need the features associated with the activities referenced by it. Let’s take the example of the language feature that describes the language of an activity. As seen in the example above, if Activity A has a language feature, this feature will already be stored in the Storage Node that contains Activity A, Storage Node 1. If we receive a new activity that comments on Activity A, we will need to propagate the language feature for A to the storage node that the comment is mapped to, Storage Node 2. Apart from this, the language feature for the comment will also be stored in Storage Node 2. To propagate the features to all the referenced content, i.e. the various ancestors, we use the Activity Graph, which stores all the relationships between activities.

Auditing challenges in feed ranking

Occasionally, we used to receive reports from our AI team partners about features for a few activities missing during feed ranking. There could be two main reasons for this:

A delay in feature ingestion if the feature consumer is not able to keep up with the feature production rate. This could happen due to unexpected bursts of traffic.
An issue with the feature propagation mechanism that may result in the feature not being sent to all the storage nodes that are supposed to receive it.

For instance, we may have missed the onboarding of “Like” activities (created when a member Likes any content displayed in their Feed) onto the Activity Graph and hence, causing ancestor features to not be propagated to storage nodes indexing the “like” activities. This issue was present for a while before we were able to pinpoint that the features were not correctly propagated.

While it is quite hard to debug why the features are missing, the bigger challenge is understanding how widespread a problem is because FollowFeed did not have a mechanism to audit the features ingested into the system.

When features were missing, we wouldn’t notice it unless there was a drop in metrics. We also did not have the metrics for feature propagation to various entities. For example, if a feature was expected to be propagated to 50 hosts, we did not have a way to tell if the features made it to all of them. We needed a long-term solution.

Solution and implementation

To get a better understanding of the features flowing into FollowFeed along with the potential propagation issues that could arise, we decided to measure feature coverage: the percentage of features that were successfully ingested and available during scoring. To understand the coverage of our features at the time of retrieval, we built an auditing pipeline that would compare the features that were ingested by the system to the features that were successfully retrieved at the time of scoring. The feature coverage can thus be calculated as the percentage of entities that have the features present in ingestion events, as well as in the response events.

All features are ingested into FollowFeed through Kafka from a variety of sources. To assess the coverage of features that have been successfully propagated, we leveraged Kafka and LinkedIn’s offline infrastructure.

Kafka provides a way to automatically ETL events into HDFS via Gobblin. The features ingested by FollowFeed are ETLed into HDFS. The response (along with the features used for scoring) sent by the dark-canary broker nodes was also ETLed. The dark-canary broker nodes essentially run the service in a dark mode without serving live traffic. They are typically used in stress testing the system. Using the dark-canary node to collect response events ensures that we do not add any overhead to nodes serving live traffic by firing sampled response events for audit purposes. We also only ETL 0.1% of the responses. We use a sampling rate because FollowFeed sends back millions of responses per second; 0.1% is enough to get a statistically significant result.

We set up a Spark job to load both ingested events and response events. The response events have all the relevant features and contain a list of entities referenced by the response. If an entity does not have features in the response, the ingestion data is checked for the feature. If the ingested data has this feature, we count this as a coverage miss.

The job processes the response events from the past day so that features coverage is computed on a day-by-day basis. The response data is typically verified against feature ingestion events from the past 30 days. The feed typically consists of activities from the past month and features required for the various activities in the response are typically ingested close to the time of creation of activities. For computing the coverage without reasonable accuracy, 30 days of feature ingestion data is used.

Metrics and monitoring

Once the coverage was computed, there was a need to observe the trend overtime to monitor and understand the pattern of coverage. This was done using Raptor, LinkedIn’s internal reporting and dashboarding platform which allows visualization of multi-dimensional data. Raptor reads the daily features coverage metrics stored in HDFS and allows filtering the data based on different dimensions for visualization. To ensure that the feed is always served leveraging all the features data, it is useful to understand the pattern of features coverage over time. To help identify anomalies, the coverage metrics were onboarded onto ThirdEye. This has turned out to be very useful in learning the coverage pattern over time and detecting outliers.

The audit tool has already proved to be useful in quickly identifying gaps in our features propagation logic or any other undesired changes. With a little tweaking, it can be extended to analyze various input and output formats of data to gather coverage-related metrics.

Acknowledgements

We would like to thank Jeet Mehta for his contributions in building the preliminary version of this system and Hongyi Zhang for his guidance and support throughout. We also received a ton of help and feedback from our amazing partner teams: Spark, Azkaban, and Feed AI. In no particular order: Hassan Khan, Parin Shah, Vivek Nelamangala, Zheng Li, Ying Xuan, Saurabh Kataria, Walaa Eldin Moustafa and Fangshi Li.

Topics: Feed Open Source Data