Strategies for Keeping the LinkedIn Feed Relevant
March 23, 2017
LinkedIn’s home feed is the starting point of a member’s journey on LinkedIn. Millions of members see the feed as their first impression. The feed is also instrumental to delivering on LinkedIn’s core value propositions: it allows members to stay connected with their network by following network activities, to build their professional identity by posting content, to stay informed by consuming recommended content, and to discover new economic opportunities.
Keeping the LinkedIn feed relevant by identifying unprofessional and spammy content is critical to maintaining the quality our members’ content consumption experiences. In this post, we describe the various processes and algorithms that keep our feed cleared of spam and relevant to our members.
Details on how we build and display the feed to members can be found in the blog series on Feed Personalization and in the FollowFeed blog post. Here, we will describe the specific components that drive content filtering and content quality-based ranking on the feed.
Challenges for content quality assessment
The role of the LinkedIn feed is to provide timely, professional content. What may pass as acceptable content on a general social network may not be a pleasing experience for a professional social network like LinkedIn. We would like to eliminate as much low-quality content from the site as possible. At the same time, we do not want to be overzealous about filtering content from the site, because that could lead to more false positives and user dissatisfaction. In other words, we need to strive for high precision and recall for our classification and labeling.
An additional challenge to content quality assessment on the feed is the scale: millions of posts and updates get shared on the feed, and those posts generate viral actions, like comments and likes, that spread this content on the social graph through the FollowFeed mechanism. To compound this particular challenge, having a few low-quality shares go hyper-viral can cause dissatisfaction for a very large number of members.
A man + machine approach
No classifier is perfect. Often, our content also falls into a gray zone where even two humans can have difficulty agreeing about whether or not it’s appropriate. Over the last year and a half, we have converged on a combined man+machine solution for content quality assessment. We describe our general approach below.
One of the key ways we increase our ability to detect spam and low-quality content is by treating content at different points in its lifecycle.
At creation: Our online and nearline classifiers label every image, text, or long form post as “spam,” “low-quality,” or “clear” in near real time. Our Support Vector Machine-based linear text classifiers classify content synchronously with a latency of 200 ms. Some of our deep neural networks take longer and are run asynchronously.
As it gathers audience: We continuously predict whether a share is likely to go viral. For this prediction, we monitor the network reach of the original poster, members interacting with the content, and the temporal signals like the velocity of likes, shares, and comments, in addition to the computed content quality scores. Our classifiers run every few hours on our Hadoop clusters to identify shares that are likely to go viral and are likely to be of a lower quality. We utilize random forest classifiers and gradient boosted trees for this prediction.
Our virality predictors are a critical piece of our spam/low-quality content fighting machinery. These algorithms select content for manual review before it has caused its potential damage. In online A/B tests, we have seen a reduction of 48% in spam and low-quality content impressions due to these predictors.
Most of the easy to catch content is filtered out at creation time, leaving behind hard to classify instances in the feed. This makes the classification task more difficult for these classifiers. Nevertheless, due to other information that augments content classification features, like the characteristics of members who have engaged with the content, the classifiers operate at a precision where automatic action can be taken. We have recently seen a 6 times improvement in volume of content classified at high precision using a combination of velocity, content, and member features.
As member-reported feedback accumulates: The LinkedIn feed also enables members to report low-quality content to us. We monitor these member flags in real time, and content that receives flags with high intensity is reviewed.
The next step in the detection process is to subject suspicious content to human review.
Our human labeling team, run by the Trust and Safety organization, is the centerpiece of our spam-fighting efforts.
We regularly send a sample of suspicious content identified by each classifier for human review so that our team can confirm whether or not content is indeed spam/low-quality. This allows us to continuously measure classifier performance. Human review of highly-viral content and highly-flagged content allows us to discover new spam patterns and augments our training data. We are now researching whether this continuous evaluation and measurement can be done efficiently using more sophisticated statistical estimation techniques.
Since our team of human evaluators is highly trained in assessing whether a piece of content conforms to LinkedIn’s standards, it also allows us to walk the fine line between freedom of expression and filtering out spam or low-quality content.
Finally, we apply varying degrees of treatment to content that has been identified by these prior steps. We have learned from experience that no single treatment suits all content. To ensure that our actions meet member expectations, the intensity of our action on content is proportional to the intensity of the content. We have heavily invested in our platforms to allow for a spectrum of treatments, including: demoting content in the feed ranking; restricting content to the immediate neighborhood of the poster; limiting places where content can surface (e.g., it can be seen on a member’s profile page but not on the feed); making it undiscoverable on the whole site; and, in extreme cases, also disabling the poster.
The LinkedIn spam-fighting strategy
Our spam-fighting algorithms and platforms heavily utilize open-sourced technologies. Our platforms are built entirely on Kafka and Rest.li. Our text classifiers are built primarily using liblinear, and our deep convolutional nets are trained using TensorFlow.
We are constantly improving our automatic and manual classification processes as more data become available. We strive to provide all LinkedIn members with clean, valuable feeds that help keep them informed so that they can stay ahead in their careers. Already, we’ve seen a 40 percent increase year-over-year in weekly engaged feed sessions, a great testament to the improved relevance of content in the LinkedIn feed.
This work was a huge undertaking, so we’d like to thank: The UCF team (Chandramouli Mahadevan, Sachin Kakkar, Prashanth Nimmagadda), the Feed and Content Relevance team (Souvik Ghosh, Tim Jurka, Mohamed El-Geish, Hui Wang, Alex Patry), the Trust and Safety Team (Vivek Nandakumar, Arun Kumar, Colin Bogart), the Feed and Security Product (Sara Todd, Ali Khodaei, Anindita Gupta, Madhu Gupta, Karthik Vishwanathan), the Feed Platform Team (Parin Shah, Vivek Nelamangala), the Machine Learning Algorithms team (Paul Ogilvie), and the NLP team (Dan Bikel)