Improving job matching with machine-learned activity features
May 11, 2022
Co-Authors: Alex Tsun, Bo Ling, Nikita Zhiltsov, Declan Boyd, Benjamin Le, Aman Grover, and Daniel Hewlett
One major goal of the LinkedIn Talent Solutions team is to match job seekers and job posters, leading to mutually beneficial outcomes. A service that any LinkedIn member can use is JYMBII (Jobs You May Be Interested In), which uses information from a member’s profile to personalize job recommendations. The following is an image of JYMBII (navigated to by clicking the “Jobs” tab at the top), which automatically recommends jobs based on your profile:
Understanding and utilizing a member’s job-seeking activity on the LinkedIn platform is crucial to making relevant recommendations. For example, a member could apply to or save a job for later, giving us information about what kind of jobs they would like to see. On the other hand, they could dismiss a job to tell us they aren’t interested. Given a member’s job-seeking activity history, we propose a technique to construct an “activity embedding” feature (vector of numbers) for them to use in our job recommendation model. This becomes important when a member is looking for a career trajectory change that cannot be captured using profile data. For example, during the pandemic, the job market has changed a lot, and people are looking for more flexibility with remote work. Several of them may start dismissing jobs similar to their current one, and apply to more remote-friendly jobs; it is crucial for us to adapt their job recommendations based on their recent job-seeking activity and not their past experience alone.
The de facto industry standard is to aggregate features over member’s job activity, which does improve personalization in addition to content-based recommendation systems. For example, suppose that member 499 had applied to three jobs: job 33, job 11, and job 54. Underneath each job is a sparse feature vector that indicates which skills are required by that job (1 if it is required, 0 if it isn’t). Summing (or averaging) those job skills features elementwise gives a feature for the member; a “member activity skills” feature, which gives us a representation of a typical job they like.
However, this approach has various limitations such as:
High dimensionality: A sparse/bag-of-words type feature such as skills will surely be high dimensional (e.g., probably tens of thousands of unique skills).
Having too many entities to aggregate upon: Furthermore, there could be other important sparse features to aggregate over as well (e.g., educational institution, industry, company, etc.) It would be great to have a dense, more compact representation of activity (i.e., an embedding), and only require one aggregation.
Aggregation function is too simple: There are infinitely many ways to aggregate features, and summing or averaging are one of the least complex ones we can use. Rather, can we use our data to learn the aggregation function?
In this post, we describe how we leveraged our members’ job seeking activities for better personalization of their job recommendations. We begin with a baseline of averaging as our aggregation, then describe how we machine-learn a significantly more complex and expressive aggregation function.
Member activity embedding platform
We first describe the dataset we use. For a member, suppose they interacted with n jobs over the last 28 days. Then, for this member, we collect a sequence of their job activities as follows:
(j1,e1,a1,t1),(j2,e2,a2,t2), ..., (jn,en,an,tn)
ji is the job id of the i-th job they interacted with
ei are (concatenated) features representing the i-th job they interacted with. This could contain information such as job industry, skills, title, etc. We have a model that allows us to represent each job as a low-dimensional vector of numbers (see our earlier blog post for details). Our ei will hence be the job embedding for the i-th job.
ai is the action performed by the member on the i-th job (e.g., apply, save, or dismiss.)
ti is the timestamp of the i-th job interaction
and t1<t2<...<tn(ordered by time). We also truncate to keep only their most recent 32 actions if they had more than 32, or pad to 32 actions if they had less.
Our goal is to utilize this information to produce an embedding (vector of numbers) representing the job activity of each member.
Baseline: Unweighted average embedding
We first tried a simple model that produced three separate activity embeddings per member, by simply averaging each job embedding within each action type. That means, we have an APPLY, SAVE, and DISMISS activity embedding for each member, each of which is the (unweighted) average of the corresponding Pensieve job embeddings.
Suppose the job embeddings for the APPLY action are vectors e1, e2, …., en sorted by increasing timestamp (most recent last). Then our APPLY unweighted average activity embedding is defined by:
Adding these activity embeddings to the job recommendation model did show promising offline and online metric gains. However, one of the biggest flaws of this baseline model is that it treats all actions in the time period with equal importance, rather than giving more focus to the more recent actions.
To improve on this baseline approach for activity embeddings, we performed experiments to answer the following research questions:
Does a simple geometrically-decaying average do better than our current unweighted averaging?
Does a machine-learned aggregation function give us statistically significant improvements over geometrically decaying averaging?
The first idea would allow us to take order into account in a naive way, and the second would allow us to machine-learn the aggregation function instead of applying a fixed formula.
Model 1: Geometrically-decaying average embedding
With simple averaging of job embeddings in an ordered sequence, we wouldn’t be taking recency into account. One simple way to include recency bias is to apply weighted averaging instead, where the weights are higher for more recent activities than older ones. While there are infinitely many possible weightings, we introduce free parameter r as a decay factor.
Suppose the job embeddings for the APPLY action are vectors e1, e2, …., en sorted by increasing timestamp (most recent last). Then our APPLY weighted activity embedding parametrized by decay rate 0 < r < 1 is defined by:
where the first scalar is a normalizing constant to ensure the sum of the weights is one. Visually (without the normalization), we have:
Within each activity type, we apply this geometrically decaying average instead of simple averaging. We used grid search to find the optimal decay parameter r, resulting in the activity embeddings with the best offline lift when added as features to the job recommendation model. The job recommendation model was then ramped online and after a successful A/B test, was deployed to all members.
Model 2: Machine-learned activity embedding
The model was simple but had very successful results in improving the performance of our job recommendation models. However, there were shortcomings to this approach:
The aggregation is limited to a weighted averaging (and a parametrized one too), where we could have any arbitrary complex aggregation function.
We do not take into account the “mixture” of the different activity types (we look at 3 independent action sequences rather than one interleaved one).
This led us to the idea of using sequence-based deep learning models, such as LSTM, CNN, and Transformer as our aggregation function. We can take as input a sequence of Pensieve job embeddings concatenated with their three-dimensional one-hot label (APPLY/SAVE/DISMISS), ordered by ascending timestamp. This is exactly how most NLP models expect their input (e.g., a sequence of word embeddings). In the large blue highlighted region in the following diagram, you can see the input sequence and the resulting learned activity embedding after applying a CNN and retrieving the last hidden layer. This CNN sequence model serving as our aggregation function has much more power compared to our fixed geometric decay formula. However, to learn the weights of the model to derive the actual embedding, we need a proxy training task. We proposed the following model as a first proof-of-concept:
For a member with n activities in the last 28 days, we feed in as input to our sequence model the first n-1 job embeddings concatenated with their one-hot label, and hold out the last (job embedding, label) pair. Feeding the n-1 input vectors to the sequence model results in a final single embedding in the last output layer. We enforce it to be the same size as the input job embedding, because now we use the job embedding we held out and apply a Hadamard (pointwise) product, followed by a fully-connected layer and softmax to predict whether the held out action was positive (APPLY/SAVE) or negative (DISMISS). This gives us a standard classification task with cross-entropy loss to train the sequence model weights, which is what we care about. We would also have a fixed sequence length to our CNN of size n=32, done by truncating any member with over 32 activities to keep only the latest 32, and by padding any member with under 32 activities with zero embeddings and a special PAD label.
During our initial training experiment, we noticed poor results with our trained embeddings in our downstream job recommendation task, compared to our previous simple geometric decay model. This was strange given that this model theoretically could learn a complex aggregation resulting in more information-rich embeddings, but not a surprise given the data distribution. The ratio of positive to negative labels was approximately 97% to 3%, and the median sequence length of active members (number of activities per member in the last 28 days) was not that large. To solve these problems, we implemented two modifications to our data preparation:
Random Negatives: First, we generated random negatives as follows: for each positive row in our training data such as (member_id, job_id, APPLY, timestamp), we generated a negative row of the form (member_id, job_id_neg, RAND_NEG_FOR_APPLY, timestamp + Delta) where:
job_id_neg is a uniformly chosen job_id from the set of all job’s interacted with by any member on the same day. We note there is a small probability that the member_id would have actually been interested in this job_id_neg, and may actually have been a positive.
RAND_NEG_FOR_APPLY is just a new fake negative label that would have lower weight than the hard negative label DISMISS.
Delta is a uniformly random small perturbation in either direction since we do have to collect the member activities in order of timestamp.
This resulted in our data having a very close distribution to 50/50, and helped our model learn from more negative examples.
Sliding Window: Given the small median number of activities per member in a 28 day window, we decided to:
Filter out any rows that had less than four activities. This is because in our training setup, we use the first n-1 activities to predict the last one, and having too little information makes it hard to predict the next activity.
On the other end of our spectrum, 1% of members had over 100 activities. We were losing a lot of training data by truncating, so instead, we applied a sliding window of size 32 to create multiple rows of training data for these members.
After cleaning our training data with these two techniques and retraining with some hyperparameter tuning, our machine-learned activity embedding resulted in improved offline metrics over the three geometric decay embeddings in JYMBII with a 3x reduction in storage space (one embedding vs three embeddings).
Once the training is complete, we “package” the Tensorflow sub-model highlighted in blue and use it for embedding precomputation. We have set up a regularly scheduled inference pipeline that publishes the pre-computed activity embeddings for each member to our Feature Marketplace for consumption by other AI models (e.g., JYMBII) during both offline training and online inference. With activity being a time-sensitive feature, it is important to recompute these embeddings often to give the most relevant results.
We are constantly experimenting with new approaches to designing activity embeddings, but they all take the same raw input—for each member, a sequence of job embeddings corresponds to their job activity history. Hence, when designing our inference pipeline, we kept experimentation velocity in mind. When a new model (aggregation function over job embedding sequence) is ready, it only takes one line of code to add the model to our daily serving flow. Our only requirement is that the model is packaged as a Tensorflow model (e.g., when computing our geometrically decaying average embeddings, we had a dummy loss function and the forward pass of the model just computed the embedding via the formula). Our inference pipeline loads all activity embedding models, prepares a member’s data, and then infers each version of activity embedding offline and pushes a copy online. These features are then easily consumed by downstream models, such as JYMBII and Job Search in producing higher quality recommendations.
Impact and future work
Because of how easy experimentation is, we have ramped four iterations of member activity embeddings within six months. Each iteration of integrating our embeddings to new products has resulted in statistically significant single-digit percentage improvements in each of their key metrics. Across the four iterations, the number of applies increased t in JYMBII by over 10% and confirmed hires by 5% during online A/B testing. Currently, further improving embedding quality is all about modeling; once we decide on a model, it requires very little time to productionize and ramp the feature in a product to A/B test.
Even though our above CNN model with pointwise cross-entropy resulted in great success, there is plenty of room to iterate further.
One ongoing work is to make our model more complex and the training scenario closer to reality. LinkedIn recommendations usually come with a list of ranked jobs, referred to as a session (a group of activities happening close in time and their ordering within it considered indistinguishable). A session-based recommendation model may give a more accurate representation of the situation and perform better than the current (pointwise) model.
Another planned work is to extend the use case from job recommendations to course recommendations in the LinkedIn Learning platform. We can change the training objective from predicting the interest score of future jobs to predicting the interest score of future courses, and build members’ activity embeddings that are optimized for members’ interests in LinkedIn courses, or other use cases.
Additionally, we find that among 830M+ LinkedIn members, only ~20M members are active job seekers, having more than four monthly job activities. To understand the remaining portion of LinkedIn members, we need to import more types of activity into our feature space such as members’ search query, members’ status change, members’ feed posting and reading.
Lastly, activity embedding serving is done through a regularly scheduled pipeline. A natural improvement is to replace regular refreshes by real-time computation (see realtime activity features blog) at request time, and test the metrics gain in the downstream tasks. This would reduce the delay down to seconds, for when we can use their activity to update their job recommendations.
The success of this work was not possible without the help of the Linked Talent Solutions AI, Applications, and Horizontal ML infrastructure teams at LinkedIn. In particular, we would like to thank the following folks for their direct contributions to activity embeddings (in alphabetical order): Alex Patry, Ann Yan (alumni), Girish Kathalagiri, Huichao Xue, Jason Katz, Kevin Kao, Kunal Punera, Li Lu, Meng Meng (alumni), Minhtu Nguyen, Muchen Wu, Pei-Lun Liao, Qingyun Wan, Raveesh Bhalla, Sriram Vasudevan, Srividya Krishnamurthy, Yu Gong, and Zhewei Shi.