A closer look at the AI behind course recommendations on LinkedIn Learning, Part 2

Sneha Chaudhari

Machine Learning Engineering Manager @ LinkedIn | CMU | IBM Research | IISc

July 1, 2020

Co-authors: Sneha Chaudhari, Mahesh Joshi, and Gungor Polatkan

In part 1 of this series, we shared a high-level overview of our course recommendation engine for LinkedIn Learning. First, we provided details on the offline and online components of the system design. Later on, we discussed the three main components of the recommendation engine that are key for generating personalized course recommendations: Response Prediction, Collaborative Filtering, and Blending. We saw that both the Response Prediction and the Collaborative Filtering models play a crucial role in our recommendation engine due to their complementary nature. The Response Prediction model uses learner and course features, as well as explicit one-time engagement (clicks, bookmarks, etc.), as labels to generate recommendations. On the other hand, Collaborative Filtering relies only on long-duration engagement, i.e., course watch data, to recommend relevant courses.

For part 2, we’ll start by sharing an inside look at our Deep Neural Network-based Collaborative Filtering approach. We’ll also look at visualizations of course embeddings to demonstrate some interesting trends in the course watch data. Lastly, we’ll discuss the Response Prediction model, including the fundamental algorithmic framework for Response Prediction and our recent work to incorporate course watch data into the model in an effective way.

Collaborative Filtering

There are three main types of algorithms for Collaborative Filtering (CF) that have been developed over time: 1) User/Item-based CF, 2) Matrix Factorization techniques, and 3) Deep Neural Network-based CF approaches. With recent advances in the field, Deep Neural Network-based CF methods have gained popularity in the AI community, becoming the standard approach for CF. They have consistently outperformed earlier techniques, resulting in superior recommendation quality. They mainly benefit from the multiple layers in the neural network that can discover complex relationships within a learner’s engagement data that are not captured by other linear methods. They can also be seen as “non-linear generalizations of factorization techniques.”

We adopted Neural Collaborative Filtering for LinkedIn Learning, as depicted below. We used TensorFlow for implementation because it provides a production-ready, flexible, and scalable framework for deep learning. Next, we describe the model in detail, covering input data, architecture, model training, recommendations, and embedding visualizations.

architecture-of-neural-Collaborative-filtering

Figure 1. Neural Collaborative Filtering architecture

Input data
This approach uses only historical, long-duration engagement data, like course watch history, as input to the model. We include course watch data not just from the LinkedIn Learning Home Page, but also from other places on the LinkedIn ecosystem where courses are displayed (e.g., LinkedIn news feed) to capture a complete picture of learner preference in the model. To reduce the noise, we curate the data in a pre-processing step by filtering the learner’s course watch history using the recency of the course watch and the depth of the engagement, like total course watch time. Based on the distribution analysis of the course watch time, we apply a watch-time based threshold to consider the relevance of any course for a learner. This means that if a learner only watches the first three seconds of a course, that engagement does not impact our model in the same way as viewing a full course session.

Architecture
The Neural CF architecture depicted in Figure 1 consists of two multi-layer neural networks: one for the learner and the other one for the course. Each multi-layer neural network consists of three types of layers: input layer, fully connected layers, and embedding layer. The topmost layer of the model is the Output Layer, which computes the final score for any (learner, course) pair, used for generating the final set of course recommendations. Now, let’s look at the function of each layer in the Neural CF model.

The input layer is used to provide input data to the model. The input to a learner’s multi-layer neural network is a sparse vector of all the courses watched in the past one year period. For example, if the learner watched two courses in the past year, the learner input vector has a non-zero value for those two courses, and the rest are all zeroes. The input to the course multi-layer neural network is a sparse vector, which depicts the similarity of the course compared to all the other courses (see Figure 2 below). These similarities are pre-computed based on co-watching patterns of the courses. So, both the learner and course inputs have a dimensionality/size equal to the total number of courses available on the platform. The following figure shows an example of learner and course inputs with 3 learners and 4 courses.

diagram-showing-the-input-layer-of-neural-collaborative-filtering

Figure 2. Input layer of neural CF with 3 learners and 4 courses

Fully connected layers (also called “hidden layers”) follow a “tower” pattern, in which the bottom of the network is widest and each successive layer reduces the number of hidden units. These layers are responsible for giving deep neural network based methods their generalization capabilities.

The embedding layer outputs learner and course embeddings, which are low-dimensional, continuous vector representations learned by the model. These learned representations are unique for each learner and course, capturing relationships and patterns within the engagement data. For example, two learners with similar learning interests will have similar representations. Note that the two-tower architecture has specifically been chosen to compute the learner and course embeddings separately, so that they can be reused for other tasks (e.g., finding related courses based on the similarity score between course embeddings).

The output layer computes a ranking score between the learner and course embeddings. This ranking score is then used to generate a list of course recommendations for each learner.

Model training
Apart from learner and course inputs, we also need labels to train the Neural CF model. In our approach, we compute a binary (1/0) label for all (learner, course) pairs in the training data to quantify the learner’s interest in that course. It's important to note that the modeling objective is to predict future engagement (i.e., course watches) using the past engagement data. Hence, it is imperative to avoid leakage of future information and use course watches only before the label as context for learner and course inputs. Once we have the training data in the required format with (learner, course) inputs and labels, the model is then trained iteratively until convergence using backpropagation (a standard technique for neural network parameter learning).

Recommendations
Once the model is trained, it can be used to generate personalized course recommendations for a given learner as follows:

Compute the learner embedding and course embeddings for all candidate courses (we currently score all courses for each learner).
Use the learner and course embeddings to compute a ranking score in the output layer for each of the candidate courses.
Compute a ranked list of courses for the learner using the ranking score.
Generate a final set of recommendations by taking the top K courses.

Visualization of course embeddings
Next, let’s look at two visualizations of course embeddings given by the neural CF algorithm. Figure 3 below shows course embeddings that are color-coded according to the language of the courses. We can clearly see the clusters that are forming in the embedding space for each of the languages. What this means is that the courses of the same language are similar to each other in the embedding space, which is very intuitive and expected because courses delivered in the same language would have similar viewing patterns among learners. In other words, two courses delivered in Japanese are more likely to both be viewed by the same learner than a course in Japanese and one in Portugese. We also see that there are multiple clusters for each language, which is mainly happening due to topic-based clustering as well.

Figure 3. Course embedding visualization: language-based encoding

In Figure 4, we can see the 50 nearest neighbors (i.e., the 50 most similar courses) for an example course: “Machine Learning and AI Foundations: Classification Modeling.” Again, we can see that the course embeddings are able to capture the content and topical similarity by using only course watch data. If you observe the nearest neighbors, they are all related to the overall topic of this course—courses related to AI, big data, and data science. But we also see some interesting courses which are not similar in content, but still somewhat related, like “Predictive Customer Analytics.”

Figure 4. Course embedding visualization: 50 nearest neighbors for the course “Machine Learning and AI Foundations: Classification Modeling”

Response prediction

In this section, we’ll describe our Response Prediction part of the recommendation engine in detail. To recap, this model predicts member-course relevance using the learner’s profile features (such as skills and industry) and course metadata (such as course difficulty, course category, and course skills). It uses the historical explicit engagement (clicks, bookmarks, etc.) as the target response/label to train the model.

The fundamental algorithm used by Response Prediction is a Generalized Linear Mixture Model (GLMix), shown in Figure 5 below (Shivani Rao et al CIKM 2019). The objective of GLMix is to learn per-learner model coefficients based on the engagement actions of a learner and per-course model coefficients based on the engagement actions of a course, apart from the fixed effect/global model coefficients. The per-learner and per-course model coefficients are essential for personalization, as they can capture a learner’s unique interests as well as course-specific patterns such as popularity and affinity towards specific learner segments, e.g., job seekers. The predicted score in GLMix is then expressed via a sum of the three components: global model, per-learner model, and per-course model.

Figure 5. Response Prediction (GLMix) model, which consists of a fixed effect model and per-learner as well as per-course random effect models

Recently, we adopted a methodology to incorporate the course-level watch time of learners into our Response Prediction model. We selected the approach of using the course watch time as an appropriate weight for click-based training instances. We adopted this method for the following reasons:

(learner, course) click-instances with corresponding watch times should be given appropriate importance while training the model, and this can be accomplished by assigning a higher weight to instances that led to longer watch times.
A similar existing work observed that using watch time as a weight, rather than label/response, yielded the best performance.

Hence, we can think of the course watch time as an importance weight given to each click instance. The training instances without any watch time are given a unit weight by default. While training the Response Prediction model, these watch-time based weights factor into the loss function that is being optimized during training. As a result, this importance weight helps to promote courses with higher watch times and creates a model that can optimize for course watches, not just clicks.

The same concept is demonstrated in Figure 6 for a simple linear classifier in a two-dimensional space, which can be easily extended for GLMix models. Figure 6(a) shows an unweighted model, whereas Figure 6(b) shows a weighted model. In Figure 6(b), the two enlarged click instances are weighted instances with course watch time. As a result, the weighted model learns a different classifier, compared to the one with no weighting scheme.

diagram-comparing-weighted-and-unweighted-linear-classifier

Figure 6. Linear Classifier (a) Unweighted (b) Weighted

So far, we have trained a weighted Response Prediction model for the Learning Homepage and LinkedIn Homepage channels. We have found that assigning weights to even just ~20% of click instances makes a significant difference in the model. Another interesting finding was that different learner segments, such as job seekers, enterprise learners, etc., have distinctively different course watch behavior. Based on this learning, we plan to incorporate segmentation in the model training to capture the watch-behavior of various learner segments.

What's next?

Specifically modeling the course watch behavior has given us significant improvements in user engagement and we’ll continue to invest in this area with a couple of upcoming initiatives. We are currently working on a model ensemble that can perform personalized blending of Response Prediction and Neural CF models to improve the overall performance on the final recommendation task. Secondly, we also plan to adopt Attention Models into our Neural CF framework for learner profiling, i.e., assigning attention weights to a learner’s course watch history to capture long term and short term interests in a more effective manner.

Acknowledgments

We acknowledge the entire Learning AI team (Chris Lloyd, Fares Hedayati, Gautam Borooah, Joojay Huyn, Kai Yang, Konstantin Salomatin, Vladislav Tcheprasov, and Young Jin Yun) for their instrumental support and contribution, as well as the LinkedIn communications team and partners for helping us improve the quality of this blog post. We also thank Ananth Sankar, Kinjal Basu, and Varun Mithal for their valuable feedback during the development phase of this work.

Topics: Analytics Recommendations Artificial intelligence Open Source Data Product Design Data Management Machine Learning