Enhancing homepage feed relevance by harnessing the power of large corpus sparse ID embeddings
August 23, 2023
At LinkedIn, we strive to provide our members with valuable content that can help them build professional networks, learn new skills, and discover exciting job opportunities. To ensure this content is engaging and relevant, we aim to understand each member's specific goals and preferences. This may include interests such as keeping up with the latest news and industry trends, participating in discussions by commenting or reacting, contributing to collaborative articles, or sharing career updates.
To achieve this, we are continually working to modernize our architecture, and we have recently made further improvements that simplify the process while maintaining excellent performance. While we are still exploring the possibilities, we believe that the infrastructure we have built has the potential to benefit other large-scale modeling efforts at LinkedIn.
In this blog post, we are delighted to introduce a significant upgrade to our model's capabilities. The model can now handle a larger number of parameters, resulting in higher-quality content delivery. This development is a game-changer that taps into the full power of deep learning with large data, allowing us to offer even more personalized feed content for our members.
Transforming large corpus sparse ID features
Our Homepage Feed produces billion-record datasets over millions of sparse IDs on a daily basis. To improve the performance and personalization of the feed, we have added the representation of sparse IDs as features to the recommendation algorithms which power these products.
Our focus is on transforming large corpus sparse ID features (such as hashtag ID or item/post ID) into embedding space using embedding lookup tables with hundreds of millions of parameters trained on multi-billions of records. Embeddings represent high-dimensional categorical data in a lower-dimensional continuous space, capturing essential relationships and patterns within the data while reducing computational complexity. For example, members who share preferences or often interact with the same type of content or a similar group of other members tend to have similar embeddings, resulting in a smaller distance in the embedding space. This capability enables the system to identify and recommend content that is contextually relevant or aligns with member preferences.
Additionally, embeddings can address the 'cold start' problem, which arises when there is limited information about new items or members. By mapping new items or users to the existing embedding space, the system can generate meaningful recommendations based on similarities with known items or members. This process enhances the content quality for new members, contributing to a more engaging experience.
The efficacy of embedding mentioned above helps to deliver a more personalized feed to our members that better serves the goals of our vision.
We also acknowledge that it is important to uplevel our machine learning infra stack to ensure swift and dependable training, offline inference, and online serving capabilities for large models. Infrastructure is the backbone and fundamental determinant of how much larger/complex models we can operate on, which translates to a business impact. Reliability is also very critical to ensure that our site grows in a way that is both healthy and sustainable.
Figure 1. String ID to embedding conversion
(Table size and examples are only for demonstration, not reflective of the real production numbers)
Model Architecture Setup
Our current ranking model is a mixed effect model (GDMix) where the global model is used to capture the global trend and random effect model to account for heterogeneity. The random effect model is at per item / post level, meaning each item or post will have their corresponding coefficients to capture popularity among certain members. Random effect models are trained separately from the global model at a more frequent cadence to adapt to the rapidly evolving ecosystem.
Following the similar recipe of assigning unique coefficients to capture the finer grained signals, we aimed to enhance the global model’s predictive power by onboarding a new group of features that encompass learned embedding tables of member and hashtag IDs. Each row of the embedding table encodes the unique representation of an ID presented during training, jointly trained with the ranking prediction task. Embedding representations of these features are thoroughly interacted with existing dense Multilayer Perceptron (MLP) layers and updated through back propagation.
Figure 2. Feed second pass ranker architecture overview
In Figure 2, we define an actor as a member (Jane) who either creates or interacts with a new LinkedIn post, a root actor as the originator or the post (Jane) when another member (Rick) likes/comments/reshares on Jane’s post (Rick now becomes the actor) and a hashtag as a word or phrase presented in the LinkedIn post preceded by the symbol # that classifies or categorizes the accompanying text, i.e., #GAI.
Among these features, we also incorporated members' past history by aggregating embeddings of members they have interacted with as the final comprehensive representation. For example, to represent a particular member, not only is a single ID embedding used, we also generate the top members or hashtags this particular member has interacted with in the past number of months, and pooled the embeddings they have interacted with before passing to the MLP layer. As a result, the model parameter size increased to several hundred millions dominated by multiple large embedding lookup tables.
Dense gating with larger MLP layer
As mentioned above, one of the benefits of introducing personalized embeddings to global models is the opportunity to interact with existing dense features, most of which are multi-dimensional count-based features and categorical features. We flattened these multi-dimensional features into one dense vector, concatenating with embeddings before passing to the MLP layers for implicit interactions. We found one of the most straightforward ways to introduce gain is to enlarge the width of each MLP layer for more thorough interactions, and the gains are transformed to online only when personalized embeddings are presented. However, this comes with the cost of extra scoring latency as the result of extra matrix computations. To solve for this, we found a sweet spot to maximize our gains under the latency budget.
Later inspired by Gate Net, we introduced the widely used gating mechanism to hidden layers. This mechanism regulates the flow of information to the next stage within the neural network, enhancing the learning process. We found the approach was the most cost effective when being applied to hidden layers where only negligible extra matrix computation is introduced while producing online lift consistently.
Figure 3. The hidden gate layer, from GateNet paper
All session training data
Data is the key to unleashing the full potential of personalized embeddings. Our model was originally trained with impression data from a segment of sessions, with the understanding that limited impressions or samples are not sufficient to have a good representation for millions of members and hashtags.
To mitigate sparsity challenges, we incorporate data from all feed sessions. Models may favor specific items, members, and creators (e.g., popular ones versus overrepresented ones in training data), which may be reinforced when training on data generated by other models. We determined that we can reduce this effect by sampling our training data with an inverse propensity score (Horvitz-Thompson-1952-jasa.pdf, Doubly Robust Policy Evaluation and Optimization):
Where inverse propensity score (position, response) = RandomSession-CTR (position, response) / NonRandomSession-CTR (position, response)
Where RandomSession corresponds to an experience where a viewer sees a randomly selected item
And NonRandomSession corresponds to a ML trained model.
These weights are subsequently utilized in the cross entropy loss calculation, enabling a more accurate and balanced training process.
Flow footprint optimization
We process data with Spark in multiple stages composed of both row-wise and column-wise manipulations such as sampling, feature join, and record reweighting. While our original approach was to materialize the output at each step (TB-sized data) for ease of further analysis, this didn’t scale with dozens of flows producing giant footprints in parallel. We adapted our in-house ML pipeline to output only a virtual view of data at each step, which stores computation information for generating the intermediate data in memory without materializing it, and then triggers the whole DAG to materialize the data right before the trainer step, reducing the storage footprint by 100x.
Training speed and scalability
We solve the problem of efficient training by adopting and extending Horovod on our Kubernetes cluster. This is the first major use case of Horovod at LinkedIn. It uses MPI and finer grained control for communicating parameters between multiple GPUs. We use careful profiling to identify and tune the most inefficient ops, batch parameter sharing, and evaluate tradeoffs between sparse vs dense embedding communication. After these optimizations, we saw a 40x increase in speed and were able to successfully train on multi-billion records, with parameter sizes of hundreds of millions, on a single node in a few hours. Here are some optimizations with prominent training speedup:
Op device placement: We’ve identified that data transfer between GPU and CPU is time consuming for certain TensorFlow ops, so we placed the corresponding kernels on the GPU to avoid GPU/CPU data copy, resulting in a 2x speedup.
I/O tuning: I/O and gradient communication are huge overheads when operating with large models trained on extensive records encompassing historical interaction features. We tuned the I/O read buffer size, number of reader parsing threads (which convert on-disk format to TFRecord format), and dataset interleave to prevent data reading from being a bottleneck. We also chose the proper format for efficient passing over the network (Sparse vs Dense), which further enhanced the training speed by 4x.
Gradient accumulation: From profiling TensorFlow, we observed that there is a training bottleneck with the all-reduce. So, we accumulate the gradients for each batch in-process and perform the all-reduce for multiple batches together. After accumulating five batches we are able to see a 33% increase in the number of training examples processed per hour.
By further incorporating model parallelism for DLRM-style architecture we are able to achieve a comparable training time with model size enlarged by 20x to 3 billion parameters. We effectively addressed two critical challenges. First, we tackled the issue of running out of memory on individual processes caused by sharding embedding tables on different GPUs. This optimization ensures efficient memory utilization, allowing for smoother and more streamlined operations. Second, we solved the latency problem associated with synchronizing gradients across processes. Traditionally, gradients of all embeddings were passed around at each gradient update step; for large embedding tables, this leads to significant delays. With model parallelism, we eliminated this bottleneck, resulting in faster and more efficient model synchronization.
Moreover, the initial implementation imposed restrictions on the number of GPUs to be bounded by embedding tables in architectures. By implementing 4-D model parallelism (refers to embedding table-wise split, row-wise split, column-wise split, and regular data parallel), we have unleashed full computational power, empowering modelers to fully exploit the potential of the hardware without being constrained by the model's architecture. For example, with only table-wise split, an architecture with two embedding tables can only leverage 2 GPUs. By using column-wise split to partition each table into three parts along the column dimension and placing each shard on different GPUs, we can leverage all available GPUs on a six GPU node to achieve about a 3x training speedup, which reduces training time by approximately 30%.
We implemented it in TensorFlow and contributed code to Horovod in the open-source community as an easy to use gradient tape. We hope to foster collaboration and inspire further breakthroughs in the field.
External serving vs in-memory serving
In this multi-quarter effort, we progressively developed large models. Memory on the host used for serving was initially a bottleneck to serve multiple giant models in parallel. To deliver good relevance metrics to members earlier, under the existing infrastructure, we externalized these features by partitioning the trained model graph, precomputing embeddings offline, and storing them in an efficient key value store for online fetching. As a result, the parameters hosted in the service were limited to only the MLP layer. However, this serving strategy limited us on several aspects:
Iteration flexibility: Individual modelers could not train their MLP jointly with embedding tables because they were consuming ID embeddings as features.
Feature fidelity: Features are precomputed offline on a daily basis and pushed to the online store, causing potential delays in terms of feature delivery.
While this was not the final goal we envisioned, this worked because the members and hashtag ID features are relatively long-lasting.
Figure 4. External serving strategy (earlier architecture)
Next, the team prioritized ensuring that our online serving infrastructure is capable of serving multiple billion-parameter models in parallel. The biggest challenge we faced was the in-heap and off-heap memory needs presented by large models. We were serving dozens of models in parallel within each host, each of which could take gigabytes of memory. To address this we invested in more advanced hardware across our serving clusters with more powerful AMD CPUs and much larger memory headroom. To further understand memory usage patterns before/after taking traffic, with necessary garbage collection tuning, we heavily used tools like memory profiling. We also carefully chose the underlying data representations for model parameters with quantization, and vocabulary transformation artifacts. Pruning has also been done to remove unnecessary parts from the checkpoint to further reduce the materialized model size.
As a result, we were able to move all variants from external serving mode to in-memory serving at scale. We observed boosted online engagement metrics from this transition due to enhanced feature delivery and model fidelity. It also opened up modeling flexibilities by allowing modelers to train and serve their own large embedding tables while lowering the operational cost of maintaining separate model splitting and feature preparation pipelines.
As demonstrated in Figure 1, the final served model artifact size is dominated by TensorFlow trainable parameters and vocabulary transformation artifacts, which convert IDs in string representations to Integer. This would contribute to 30% of total model size with a traditional static hashtable approach. In this work we adopted the solution called minimal perfect hashing, which is also a collision-free hashing method but trades off latency for memory with a consistent 6x memory usage reduction across multiple variants. It was implemented as an optimized C++ custom op in TensorFlow and proven to operate within our latency budget.
Conclusion and future plans
We have been able to scale model size by 500x with training and serving infrastructure built for billion-parameter models. This infrastructure also benefits other workloads, such as graph neural networks (GNN) and large language model (LLM) training as well. We are thrilled by the tremendous opportunities this breakthrough unlocks, as it unleashes the full power of deep learning with large data delivering enhanced personalization to our members. We are committed to further nurturing the work and driving it in exciting new directions.
Flexible continuous training and incremental training
We aspire to refresh the models at a much faster cadence on the newest data. Incremental training and online learning will enable us to keep pace with the rapid evolution of the community. Recent developments in combining stateless hashing with compositional embeddings introduce controlled collisions of IDs but allow for rapid iteration at a significantly reduced operational cost seem promising.
The significance and benefits of employing DCNV2 for feature crossing and training with longer data periods have already been validated. In our pursuit of operationalizing these advancements, we are now focusing on creating a more robust serving infrastructure that incorporates GPU serving and intelligent model routing. This aims to further reduce latency costs and alleviate memory pressure for complex architectures that deliver more gains.
We find ourselves at a fairly early stage on this rapidly evolving journey and are excited about what lies ahead. The advancements we've achieved and the enjoyable process of overcoming challenges across multiple teams highlight the collaborative and innovative engineering culture at LinkedIn. This culture is relentless in tackling even the toughest challenges, driven by our ambitious goal of pushing boundaries to deliver exceptional member value, and we firmly believe that the integration of large models will drive substantial performance enhancements in AI at LinkedIn. With exciting developments on the horizon, our dedication to enriching member experiences remains central to our path forward.
This collaboration spans multiple organizations across LinkedIn with contributions from various teammates. Thank you to Amol Ghoting, Fedor Borisyuk, Jason (Siyu) Zhu, Ganesh Parameswaran, Qingquan Song, Lars Hertel, Mingzhou Zhou, Yunbo Ouyang, Sheallika Singh, Haichao Wei, Zhifei Song, Aman Gupta, and Chengming Jiang from the Data and AI Foundation Team; Birjodh Tiwana, Siddharth Dangi, Rakshita Nagalla, Xun Luan, Mohit Kothari, Wei Lu, Samaneh Moghaddam, and Ying Xuan from the Feed AI Team; Jonathan Huang, Pei-Lun Liao, Chen Zhu, Ata Fatahi Baarzi, Zhewei Shi, Jenny Zhang, Maneesh Varshney, Rajeev Kumar, and Keqiu Hu from the Machine Learning Infra Team; Shunlin Liang, Nishant Gupta, and Jeff Zhou from the Feed Infra Team; Our TPMs Sandeep Jha and Seema Arkalgud made all of this possible.