Homepage feed multi-task learning using TensorFlow

Ian Ackerman

Machine Learning Infra at Databricks

June 3, 2021

Co-authors: Ian Ackerman and Saurabh Kataria

Editor’s Note: Multi-objective optimization (MOO) is used for many products at LinkedIn (such as the homepage feed) to help balance different behaviors in our ecosystem. There are two parts to how we work with multiple objectives: the first is about training high-fidelity models to predict member behavior (e.g., probability a member will click an article). The second is around trading off different objectives for a unified member experience based on utility to the LinkedIn ecosystem (e.g., a comment is much more valuable than a click). This post will focus on the first part of multi-objective optimization, where we utilize a multi-task, deep learning model to create higher fidelity consumption models; for more information on the second part, objective tradeoffs, see this article from KDnuggets about automatically tuning this tradeoff for faster model iteration.

LinkedIn’s members rely on the homepage feed for a variety of content including updates from their network, industry articles, and new job opportunities. Members can engage with this content in multiple ways, including commenting on or resharing posts. It is the job of our relevance system to understand the different needs of these members and present a personalized experience that provides an ecosystem optimized to foster productive professional development.

We use MOO to rank the many different updates on a member's homepage feed to balance many tradeoffs for a personalized experience. There are two main parts to MOO: creating models to predict the likelihood of a member taking a certain action on a piece of content, and balancing the many objectives to create a final ranking. For the first part, a standard approach to the problem is to create many individual models, each optimized for a different user behavior; however, this approach limits the learning capability of the overall modeling platform, due to inference latency growth with more computations needed for these objectives and lack of shared learning among related behaviors. We have recently overhauled our feed modeling platform to jointly train models of multiple objectives in a unified, multi-task learning setup in Tensorflow, using deep learning to rank content seen during each homepage visit. This post describes the techniques we used to accomplish this at LinkedIn’s scale and the benefits to our members that this holistic setup creates.

Feed ranking overview

Recommending the right content to LinkedIn members on the feed involves both understanding the different types of content available and personalizing for the member based on many different goals. The LinkedIn homepage serves members as both passive consumers who read about the latest professional news and conversations and as active participants in those conversations (by commenting, reacting, voting, or resharing updates). The relative likelihood of a member taking actions can be quite different; for instance, someone may be a hundred times more likely to read an article than to reshare it with their network. It’s important to ensure more likely actions don’t overrule these valuable, infrequent actions.

The LinkedIn homepage feed ranking system consists of two passes, where the first pass is responsible for accumulating content results from a member’s network (such as career changes, stories, news, etc.) and the second pass is responsible for ranking the accumulated content results. We leverage a multi-objective optimization framework for both passes, combining the probabilities from other models that predict various behaviors of LinkedIn members on the homepage feed. Overall, the ranking score from the utility function is calculated as follows:

ranking-score-equals-alpha-of-f-sub-p-open-parenthesis-passive-consumption-closed-parenthesis-plus-open-parenthesis-one-minus-alpha-closed-parenthesis-times-f-sub-a-open-parenthesis-active-consumption-closed-parenthesis-plus-lambda-of-f-sub-o-open-parenthesis-other-closed-parenthesis

F_p(passive consumption) refers to passive consumption related objectives (member behaviors) such as clicks, dwell time, etc. F_a(active consumption) refers to active contribution interaction objectives such as comments, reshares, etc. F_o(Other)accounts for objectives that do not fall into these two categories, such as creator side feedback. Alpha and lambda are tuning parameters balancing for a healthy ecosystem of increased active consumption, while not hurting more passive members’ interests. The main intuition behind separating active and passive related objectives is the distinct nature of corresponding members’ behaviors, as we discuss in the next section.

Previous modeling setup
In tackling this problem, we historically have trained separate models for all the member behaviors grouped under passive and active consumption. This approach solves issues like label skew and allows for negative sampling techniques when training for an infrequent objective. Despite a read being much more likely than a reshare, LinkedIn values contributions like resharing because they lead to a more engaged network. This influences our modeling in giving different values for the probability of a passive consumption versus the probability of active consumption.

The actual model platforms we have used for predicting these actions of our members have been separate logistic regression models using Photon-ML along with XGBoost to generate derived features. Modeling these behaviors individually also allowed us to quickly optimize and personalize tradeoffs between objectives by leveraging Bayesian processes in an online system.

While separate models are advantageous in being flexible and independent for doing data sampling and feature selection, using separate models is inefficient in transferring learnings between objectives. For example, the comment model can transfer learning from like behavior and just focus on unique behavior of commenters rather than learning from scratch. Separate models also imposed limitations on scaling our objectives because each homepage visit requires real time computing of each objective per update from the member’s network. As the feed has introduced more fine-grained objectives, currently extending into double digits, this has continually increased latency. This is especially wasteful for very similar objectives, such as reshare and comment.

An illustration of the separate modeling setup that was historically used for feed ranking. Logistic regression models in yellow determine each objective’s probability. XGBoost in red and manually-derived features in green are inputs to these separate models.

New multi-task deep learning setup

We’ll first quickly describe our deep learning setup at a high level before detailing the challenges and difficulties that shaped these design choices. We overhauled our homepage feed modeling framework to jointly train several of the related objectives. This multi-task setup not only facilitates sharing across related objectives, but also simplifies our otherwise linearly (in terms of models per objectives) growing modeling framework.

An illustration showing the new multi-task setup. Two groups of objectives are trained with two separate neural networks with common inputs.

Below are a few salient features of the setup.

Multi-task setup: Multi-task formulations of machine learning problems allow for joint learning of multiple objectives, exploiting differences and commonalities between objectives to improve performance over separate formulations. We designed our deep learning architecture to jointly train on both passive and active consumption objectives, with a specific focus on community sharing objectives, via our XGBoost-based feature design, described later in this post. In addition to facilitating shared learning among different objectives, this setup also simplifies the modeling paradigm, with just a single model to output response probability for our overall utility function.

Response grouping for shared learning: For optimal transfer learning among objectives, we group objectives into two categories: (1) passive consumption oriented, (2) active consumption oriented. As a result, we split our deep learning network into two towers representing each category, shown as two different colored towers in figure above. We call this “two-tower multi-task deep learning setup.”

Data sampling and model calibration: For data sampling, we no longer perform custom downsampling per objective, but instead utilize all recorded interactions with updates across objectives. Despite fears that this could cause very uncommon objectives to be ignored, with enough data and normalization of features, all of our objectives were well trained and calibrated. This allowed our new deep modeling setup to still interact well with our generalized linear models as they act as a well-calibrated fixed effects model for our random effects models to train.

Feature space: Our multi-tower architecture is composed of multiple layers of fully connected layers that start with an embedding lookup for input-sparse features. The majority of features we have historically built have been sparse, including categorical and numeric. We pass all these sparse features into a single XGBoost tree, utilizing the leaves as input categorical features, which we embed. For high dimensional features, such as text, images, etc., we utilize dense embeddings that we directly take into Tensorflow, relying on batch normalization or projection layers to ensure smooth training.

Multi-task modeling details
As mentioned previously, we had historically been training a logistic regression model for each of the above objectives separately, effectively limiting the shared learning among related objectives. In the spirit of training a joint model, we started with a single model to output probabilities for all the objectives in our utility function. However, we discovered that approach performed suboptimally compared to separating the parameter space between passive consumption and active consumption oriented objectives, hence the two-tower setup.

We optimize for cross entropy loss per objective to train a multi-layer network for both the towers. We identified the following key challenges and learnings for the model training process:

Model variance: We observed significant variance in model performance, especially for sparse objectives (e.g., reshare) that correlated with output in both our offline evaluation metrics as well as online A/B testing. We identified the initialization and the optimizers (such as Adam) that contribute significantly to variance in the early stage of training. A warm start routine to gradually increase the learning rate helped to overcome a majority of the variance problem.
Model calibration: Our feature and model score monitoring infrastructure (such as ThirdEye) helped to identify several model calibration challenges, especially at the interaction stage with modeling components external to the deep learning setup. O/E ratio mismatch among different objectives (compared to our previous setup) was one such challenge, and we identified several sampling schemes for negative response training data affecting O/E ratios.
Feature normalization: Our XGBoost based feature design provides the model with an embedding lookup layer that avoids the feature normalization issues for model training. However, as we expanded into embeddings based features, we realized that normalization would play a major role into the training process. Batch normalization and/or having a translational layer helped alleviate some of these problems.

Analyzing problem space of efficient scoring

In order to score a deep neural network model across hundreds of features and many objectives, it was important to instrument and measure all aspects of our model inference. For this, we did JVM profiling for our standard modeling stack on Java, TensorFlow profiling using tf-profiler, and instrumentation of system latencies and metrics.

When we investigated the performance of our previous multi-objective models, it became clear that evaluating each model independently added significant costs. As the models were very mature, they had accumulated historical features over time that were manually crafted that were no longer optimal years later. This manual feature engineering has been superseded by more powerful techniques like XGBoost or the neural networks described in this post. Profiling made it clear that while XGBoost feature derivation accounted for a minority of computation time, it was overwhelmingly important in our model compared to hand-crafted features.

Looking into the inference costs of deep learning, major standouts were the conversion costs from sparse string features to integer backed tensors, the serialization costs of features through gRPC to TensorFlow Serving, and the costly sparse embedding lookups within TensorFlow. Luckily, the LinkedIn feed has recently built out tensors as standard feature representation, which removed conversion costs by having tensors as first class citizens for training and evaluation. Inference batch size was an important configuration for us to tune as well. TensorFlow was able to scale much better with larger batch sizes as compared to our previous setup because it explicitly batched features together for efficient linear algebra calculations within a batch. Tuning the batch size to be higher allowed for better parallelization within a mini-batch while still allowing separate mini-batches in a feed session to be scored in parallel.

XGBoost as feature encoder

An illustration of how we utilize the leaves of the XGBoost trees as lookups into embedding to form the input layer to the neural network.

Simplification: For representing features, there was a clear benefit to pushing more work onto XGBoost while eliminating manual feature interactions. Scaling the computing of more trees was well optimized in our serving system, while many old, custom feature derivations were quite costly. To take advantage of the boosting nature of XGBoost, we eliminated separately trained XGBoost models and instead tripled the forest size of a superset objective.

Feature embedding: XGBoost can encode a derived feature by outputting a selected leaf node. These can be enumerated across the whole forest to set up a contiguous, dense space that fits naturally into embedding lookup. This direct embedding lookup avoids any issue of normalization of numeric features, as the embedding space is chosen by the training process itself regardless of eccentricities of input features. These embeddings are a natural fit for feeding into further layers of a neural network. While this limits TensorFlow to only consider the feature splits that XGBoost chose, by having sufficiently deep trees and a large forest ensemble, we were able to get good coverage of our feature space.

Serialization: Since we use TensorFlow Serving for our TensorFlow models, we have to make an extra gRPC call for all features sent to the external process. We can pass the XGBoost encoded leaves in a dense tensor, as the number of trees and leaves per tree can be fixed. A dense tensor simplifies the data to send over gRPC to a third of the size of an equivalent sparse tensor representation (as no indices are needed).

charts-showing-progression-of-relative-costs-after-optimizing-embedding-lookups-and-sparse-operations

Progression of relative costs after optimizing embedding lookups and sparse operations (from left to right). Dense matrix operations remain constant in absolute terms throughout iterations.

The diagram above details the progression in optimizations we did in reducing sparse concatenations of many separate features, from only utilizing a tree input as a sparse feature, to the dense tensor described above. The yellow dense operations remained a constant cost, but better feature encoding vastly sped up training and evaluation by utilizing XGBoost leaf nodes intelligently.

Converting feature space: Another advantage of utilizing XGBoost as encoding input to TensorFlow as an embedding is that it doesn’t require expensive conversions involving vocabulary maps from human readable features into a space for easy embedding/matrix multiplication. Previously, adopting this into models required expensive lookups/conversions on the critical path, which was measured at one point to be around 20% of latency in the feed model during tensor migration. While not needed for the second pass ranking system due to the above mentioned migration, XGBoost normalization allowed for easy experimentation with TensorFlow in our models that hadn’t yet fully migrated to tensor features.

Looking forward

Thanks to the efficiencies in feature encoding with XGBoost, along with multi-task architecture, we were able to ramp the new feed experience to LinkedIn members with overall faster inference time. The strengths of transfer learning between different objectives and better feature representation allowed for a richer, more personalized feed. Thanks to a holistic modeling setup, this approach led to an increase in member engagement for both passive consumption and active consumption. The Feed AI team is very excited about the future avenues we can explore on feed now that we have a flexible training platform like TensorFlow. One of the first steps is onboarding embeddings directly to our neural network for rich information about content. We are also exploring different architectures, including normalization, so we can reduce reliance on XGBoost for onboarding the most important feature directly to TensorFlow.

Acknowledgements

We had lots of help in making this project a possibility across LinkedIn. Thank you to Jason Zhu and Qi Xu from Feed AI; Shunlin Liang, Virurpaksha Swamy, Siva Popuri, Lance Wall, Aliaksei Dubrouski, and Jimmy Guo for helping scale online inference; and Pei Lun, Ann Yan, Abin Shahab, and Logan Bruns for collaborating on offline training infrastructure.

Topics: Analytics Feed Artificial intelligence Open Source Machine Learning