GDMix: A deep ranking personalization framework

Jun Shi

Software Engineer at Airbnb

September 29, 2020

Our logo is inspired by the chameleon: You can enable personalization on your ranking model with GDMix, bringing a personalized experience to every user, like a chameleon that can match its surroundings.

Co-authors: Jun Shi, Chengming Jiang, Aman Gupta, Mingzhou Zhou, Alice Wu, Yunbo Ouyang, Charles Xiao, Jun Jia, Haichao Wei, Huiji Gao, and Bo Long

July 2021: This post was updated with additional coauthors to reflect their contributions to the GDMix project.

Millions of members come to LinkedIn every day to search for other members, look for job postings, read content, and learn new skills. With over 706 million members, hundreds of billions of feed updates, and more than 16,400 courses to choose from, none of these tasks would be possible without artificial intelligence (AI). By analyzing the interactions between members and LinkedIn webpages, we can provide the most relevant information and best experience possible for our members. Personalized ranking for search and recommender systems is one of the key technologies to achieve that goal.

In general, a ranking model considers four types of features: request features (e.g., member features), document features (e.g., item features), context features, and interaction features. A fully personalized ranking algorithm includes all these features, and especially includes a large number of categorical ID features. The interactions between a member and an item such as a member profile, a job posting, or a blog post usually result in a large number of features. For example, the interactions between more than 700 million members and millions of items on LinkedIn result in a model with tens of billions to a trillion features. It is very difficult, if not impossible, to train models of this size efficiently. Training such models may require specialized processors, extraordinarily large system memory, and ultra fast network connections, among other challenges.

GDMix (Generalized Deep Mixed model) is a solution created at LinkedIn to train these kinds of models efficiently. It breaks down a large model into a global model (a.k.a. “fixed effect”) and a large number of small models (a.k.a. “random effects”), then solves them individually. This divide-and-conquer approach allows for efficient training of large personalization models with commodity hardware. An improvement from its predecessor, Photon-ML, GDMix expands supports deep learning models. It can be applied to a variety of search and recommendation tasks, as listed in Figure 1.

Some key features of GDMix:

Model scalability. GDMix splits the model into a fixed effect and many random effects. This separation enables you to train models with hundreds of millions entities and tens of billions of parameters.
Model flexibility. Both the fixed effect and random effects are designed to support various model types. The fixed effect natively supports linear models and deep learning models. The random effect natively supports linear models. It is easy to add custom models, such as support vector machines (SVM), decision trees, and gradient boosting algorithms, to GDMix.
Training efficiency. GDMix is designed to train large models fast. With large-scale parallelism, it takes less than an hour to train models with millions of entities and billions of parameters.

personalization-in-center-bubble-with-arrows-out-to-job-recommendation-movie-recommendation-content-search-ecommerce-search-app-store-recommendation-and-ads-ranking

Figure 1: Examples of personalization tasks that can be solved by GDMix

Introduction to personalization

Personalization in the context of ranking modeling for search and recommender systems means ranking items according to the interests of an individual or a specific group. This technique is widely used in the social network and e-commerce industry to improve user satisfaction.

Personalization exists in many LinkedIn products. When a member does a search on linkedin.com, the results are generated by considering the member’s features, such as network connections, past interactions with other members, past or current companies and colleagues, etc. These personal signals help to retrieve relevant documents and rank them in the right order. The value of personalization is more obvious when the member’s intent is implicit, e.g., no query is given. For example, LinkedIn’s People You May Know and Jobs You May Be Interested In are two products where the member’s profile, networks, and activities on LinkedIn are used to generate a list of relevant member profiles or job postings.

One method to create personalized models is to include features that reflect individuality. Let’s consider job recommendations for two of our members, Alice and Bob. They both live in the San Francisco Bay Area and have similar profiles. Both are recent college graduates with a bachelor's degree in computer science. Alice wants to stay close to home, while Bob isn’t opposed to relocation. If the geo-matching between profile location and job location is a feature in the job ranking model, then this feature should carry different weight for Alice and Bob. This, however, is impossible because we can only assign one value to a feature. Personalized features come to rescue in this instance.

We can achieve personalized features by crossing the existing features with entity IDs, resulting in a set of new features specific to the entity IDs. In the example above, if we cross the geo-matching feature with member ID, we arrive at two features: “Alice_geo-matching” and “Bob_geo-matching.” It is now possible to assign two different weights for this feature.

On the surface, personalization at the finest granularity is simply solved by crossing the entity IDs with existing features, such that each entity gets a copy of all the features. This approach, however, does not scale. For the job recommendation example, with more than 700 million members and 100 features per job, we end up with a model of 70 billion features. A model of this size can not be easily trained, despite recent advances in computer hardware. GDMix provides a solution to train these models efficiently.

Mixed model: Fixed effects and random effects

Before we dive into the details of GDMix, let’s first understand what a mixed model is and how it is related to personalization.

A mixed model is a statistical model containing both fixed effects and random effects. The fixed effects set the global trend and the random effects account for the individuality. Let’s go back to the job recommendation example for Alice and Bob. Both of them have “Tensorflow” and “machine learning” listed in their skills. A fixed effect model predicts “machine learning software engineering” jobs are good matches for them. It prevents us from sending irrelevant recommendations such as “sales” jobs to them. The random effect models learn from their past activities that Alice clicked local job postings exclusively, while Bob was not concerned with the job location. These models identify that difference and rank local jobs higher in recommendations to Alice while discounting the importance of job location in recommendations to Bob. It is the combination of fixed effects and random effects that ensures high quality, personalized results.

GDMix

In the job recommendation example above, we arrived at a model of 70 billion features. GDMix offers an efficient solution to train this model by taking a parallel blockwise coordinate descent approach (Figure 2). The fixed effects and random effects can be regarded as “coordinates.” During each optimization step, we optimize one coordinate at a time and keep the rest constant. By iterating over all the coordinates a few times, we arrive at a solution that is close to the solution to the original problem. The models belonging to each random effect are independent of each other. Thus, we can train them in parallel. In the end, we break down the 70-billion-feature model into 700 million small models that are much easier to tackle individually.

Besides per-entity random effects, GDMix also supports training per-cohort random effects. A cohort is a group of entities that share certain characteristics. For example, all members in a geographical location can be regarded as one cohort. The difficulty with per-cohort random effects is that the number of training examples is usually fairly large compared to per-entity random effects. GDMix can combine multiple cohorts together and solve for the appropriate models for them by using the fixed effect solver.

data-preparation-followed-by-train-fixed-effect-followed-by-train-random-effect-1-followed-by-train-random-effect-2-then-returning-to-step-between-data-preparation-and-train-fixed-effect

Figure 2: Parallel blockwise coordinate descent

Until recently, the fixed effects and random effects were modeled by logistic regression and solved by Photon-ML. A logistic regression model for the combination of fixed effect, per-member random effect, and per-job random effect can be represented by Eq. 1.

g-open-parenthesis-e-open-bracket-y-sub-m-j-t-closed-bracket-closed-parenthesis-equals-x-prime-sub-m-j-t-times-b-plus-s-prime-sub-j-times-alpha-sub-m-plus-q-prime-sub-m-times-beta-sub-j

Eq. 1: Logistic regression model, where g() on the left side is the logistic function. The three terms on the right side are the fixed effect, per-member, and per-job random effect, respectively.

The linear tems on the right side of Eq.1 can be expanded to arbitrary functions that are often represented by deep neural networks.

g-open-parenthesis-e-open-bracket-y-sub-m-j-t-closed-bracket-closed-parenthesis-equals-f-sub-g-open-parenthesis-x-sub-m-j-t-comma-b-closed-parenthesis-plus-f-sub-m-open-parenthesis-s-sub-j-comma-alpha-sub-m-closed-parenthesis-plus-f-sub-j-open-parenthesis-q-sub-m-comma-beta-sub-j-closed-parenthesis

Eq. 2: Non-linear representations of the models in Eq. 1

GDMix expands the modeling capacity to include deep learning models. In particular, GDMix leverages DeText, a deep learning ranking framework for text understanding, as its native deep learning model trainer. A user can use the rich deep neural network architectures provided by DeText to model the relationships between the source (e.g., query, member profile) and target (e.g., job posts). In addition, per-entity random effect models are readily available to provide further personalization.

Currently, GDMix supports three different operation modes (Figure 3).

Fixed effect model: logistic regression; random effect model: logistic regression.
Fixed effect model: deep NLP models supported by DeText; random effect model: logistic regression.
Fixed effect model: arbitrary model provided by a user; random effect model: logistic regression.

In the last mode, the fixed effect is trained by the users with their own model, then the scores from that model are treated as input to GDMix random effect training.

chart-showing-that-fixed-effects-natively-supported-by-gdmix-include-logistic-regression-and-detext-with-custom-models-an-option-but-not-natively-supported-and-random-effects-natively-supported-by-gdmix-are-logistic-regression

Figure 3: GDMix operation modes

Implementation

Both the fixed effect and random effect model training process consist of multiple stages. The fixed effect training and scoring are done in a Python job, followed by a Spark job which computes the relevance metric on the validation dataset. The random effect starts with a Spark job that joins the fixed effect model scores with the training data. It then partitions the joined data so that each worker is responsible for a subset of the entities. The training and scoring is again done by a Python job, followed by a Spark metric calculation job. These stages are illustrated in Figure 4 and Figure 5.

Logistic regression models are optimized by the L-BFGS solver from SciPy, while deep models are solved by gradient descent based optimizers native to TensorFlow. Fixed effects are usually represented by a large model trained on massive datasets with hundreds of millions or even more than a billion examples. Data parallelism is a natural fit to train these models. The entire training dataset is partitioned to many shards and consumed by hundreds of workers. The workers compute the local gradients and share them via all-reduce ops or send them to a parameter server. The communication between the workers is implemented with TensorFlow distributed training APIs.

The metric computation stage computes area under curve (AUC) on the validation dataset. The computation is designed as a separate Spark job so that it is easy to add more metrics and to process very large datasets.

The generated models are stored in the same name-term-value format (a sparse format) as Photon-ML for backward compatibility. The models can be used for warm start training or inference.

flow-chart-with-bubble-for-train-slash-score-fixed-effect-model-in-orange-then-an-arrow-to-another-bubble-for-compute-metrics-in-blue-the-key-shows-orange-means-tensorflow-and-blue-means-spark

Figure 4: Steps in fixed effect model training

Random effects, on the other hand, contain millions of small independent models. A simple parallel algorithm works well, where each worker gets a subset of the models and trains them independently. Data preprocessing is done efficiently by Spark to group the samples according to the entity ID (e.g., member ID). The divide-and-conquer method achieves faster training speed than Photon-ML due to heavily localized processing. The random effect model training steps are illustrated in Figure 5.

flow-chart-with-bubble-for-partition-data-in-blue-then-an-arrow-to-bubble-for-train-slash-score-random-effect-model-in-orange-then-an-arrow-to-bubble-for-compute-metrics-in-blue-the-key-shows-orange-means-tensorflow-blue-means-spark

Figure 5: Steps in random effect model training

Results

We have evaluated GDMix with logistic regression fixed effect models and with DeText fixed effect models on LinkedIn’s various internal datasets. We have seen a 10% to 40% decrease in linear model training time compared to Photon-ML, without any loss of relevance metrics. Additionally, the combination of a DeText fixed model and linear random effects showed a 0.5% to 3% relevance metric lift over the pure linear models. Finally, these improvements are part of an overall redesign of the search experience for LinkedIn members, which includes numerous design changes to improve the overall member experience. Efforts are underway to integrate GDMix into LinkedIn’s AI production workflows. LinkedIn learning has ramped GDMix to production in all 5 channels, showing a 50%-70% speed up in training time with 34%-88% fewer resources. Initial results from the LinkedIn Feed shows GDMix with logistic regression reduces end-to-end training time by 45% with 55% fewer resources. In LinkedIn Ads CTR offline evaluation, GDMix with logistic regression reduced training time by 15% with 43% fewer resources, GDMix with DeText fixed effect models showed a 1.6% AUC lift on fixed effect, and a 0.76% overall AUC lift.

Getting started with GDMix

If you are curious about how personalization models work, or if you want to add a natural language understanding unit in your modeling pipeline, or if you just want to do a quick test to see if adding random effects can improve your current model performance, visit our GitHub repository and follow the instructions there. GDMix is still under active development; feedback and contributions from the community are welcome.

Acknowledgements

GDMix is developed by the AI Foundations team at LinkedIn: Jun Shi, Chengming Jiang, Mingzhou Zhou, Alice Wu, Lin Guo, Yunbo Ouyang, Charles Xiao, Jun Jia, Haichao Wei, Huiji Gao, and Bo Long, and the Flagship AI team: Aman Gupta and Xue Xia. We thank the following colleagues for their advice, support, and collaboration: Liang Zhang, Bee-Chung Chen, Keerthi Selvaraj, Yazhi Gao, Rohan Ramanath, Ruoyan Wang, Kinjal Basu, Wei Lu, Ying Xuan, Romer Rosales, Arjun Kulothungun, Onkar Dalal, Deepak Kumar, Joojay Huyn, Konstantin Salomatin, Kai Yang, Mahesh Joshi, Gungor Polatkan, and Deepak Agarwal.

Topics: Artificial intelligence Open Source Member/Customer Experience Machine Learning