Applying multitask learning to AI models at LinkedIn
August 19, 2022
Multitask learning, a subfield of machine learning, aims to accomplish multiple learning tasks at the same time by exploiting commonalities and differences across tasks. Traditionally, engineers have built separate AI models to accomplish each task at hand. While this allows a simple boundary of ownership, the isolated models lack the ability to share any learnings gathered from training the models. For example, new machine learning (ML) tasks often suffer from the cold start problem due to a lack of training data.
At LinkedIn, we try to improve the performance of AI models by training them in groups of related tasks, which has led to better model performance in our experiments. As ML models power important functions such as job recommendation, job search, and many other products on LinkedIn, the improvement from these models lead to a more effective and relevant jobs marketplace for our members.
To address the fundamental challenges in applying multitask learning to general modeling problems, we have designed a multitask training framework. Unlike confined modeling domains in which tasks share the same training data, model architecture, or loss function, building a multitask learning framework for LinkedIn’s heterogeneous use cases had several challenges.
The first challenge that we faced was that different tasks may have a vastly different distribution of labels, which happens a lot among upstream and downstream tasks. For example, we may have tasks to predict for job view and job application. While both tasks are from job search, for every job view, maybe only 10% of it leads to a final job application. This means job applications could easily be overwhelmed by job views. To solve this, we relax the constraints in training and allow different tasks to have separate datasets as input. Using the previous example for a job search example, job application could choose to use a down-sampled training data from the job view training data, to prevent its sparsely populated signals from being overwhelmed.
With the relaxation of the input datasets, another challenge we faced was that different sets of training data may not need to share the same set of features as different tasks may have different input features. This is especially the case if tasks to be trained together are from different domains or products. At LinkedIn, we have a job search feature for members to find jobs and we also have recruiter search for recruiters to find candidates for a job. These two tasks are related but vary in the features that are required. Through unifying the input data scheme across tasks, we were able to unify schemas of different tasks for multitask training.
As shown in Figure 1, task A has two unique features (in red), task B has two unique features (in blue), and both tasks have one shared feature. Our framework can identify shared features and task-specific features. As a result, it will unify the input data schema of both tasks and fill the missing fields to a predefined default value (0 in the above figure).
The last challenge we faced was different tasks using different model architectures and having different input layers or loss functions. The solution we implemented was to allow model developers to specify shared model architecture and task-specific model architecture in the same way as in single-task learning models.
As shown in Figure 2, model developers can define the model architecture of each task, similar to single-task learning. We automatically identify the model component that is shareable across tasks and as a result, there is a shared model for overlapping parts between tasks and task-specific models for non-overlapping parts.
How training works
As explained earlier we may have different model architectures for different tasks, with different loss functions and different hyper-parameters such as learning rate. We support two different mechanisms for training: 1)iterative training where each task and its associated model will be trained separately in an alternating fashion, and 2) joint training where training data from different tasks are combined and the task losses are combined and back propagated to the shared model.
We demonstrate the difference between the two through a classical shared-bottom model structure in Figure 3.
Fig. 3 Left: Joint Training; Right: Iterative Training. Red arrows indicate task 1 data, blue arrows indicate task 2 data, black arrows indicate combined data
In summary, here are the key differences between the two approaches:
Joint training, because of the combined input batches, requires one single pre-defined shared sub-model among tasks. On the other hand, iterative training may be flexible with different kinds of hard or soft sharings.
Relative weightings of different tasks are a difficult hyper-parameter for multi-task training. In joint training, they can be learned by regularizing or balancing the loss/gradients from the combined loss/shared layers. In iterative training, because there is no combined loss, only tasks’ gradients from shared layers could be leveraged.
Joint training is more suitable for cases when there are interactions of task losses. One example is from knowledge distillation use cases, while we can have two task losses: a teacher loss and a student loss, a third hinged differentiation loss depending on the two can also be included. Calculation of the hinge loss requires the teacher and the student to be trained on the same training batches.
Either one of the two approaches may be selected depending on the use case.
Cross-domain skill understanding
For job skill extraction, we heavily rely on a contextual skill model. With this model, we tag out skill terms using a skill entity tagger and then use a contextual skill model to consider the tagged skill terms in the context of the job posting, to determine if it is a valid skill entity. For example, Spark is a data analytics tool, but in the context of the phrase “electric wires give off spark,” it is not a skill.
The contextual skill model takes in <skill entity, origin sentence> and outputs a confidence score (0~1). For several quarters, we have collected a considerable amount of annotation for training data from job postings to improve the quality of the contextual skill model. When we first started to extract skills from resumes, there was not a lot of training data collected. Because skill extraction from resumes is similar to skill extraction from jobs, we were able to adapt the job contextual skill model to be used on resumes. We changed the job contextual skill model to support multitask training. In the new model, there is a shared model component that is trained using combined training data from job and resume, and there are domain specific components that optimize for each domain separately.
The training data from the job domain is in seven different languages. In production, we will need to be able to handle the same seven international languages well in both job and resume domains. As a result, for sentence context, we chose to use an in-house trained multilingual word embedding.
For skill entities, we chose to use an in-house trained skill entity embedding instead of skill raw term text. Skills are conceptual entities and are language-agnostic in nature. If we treat skill terms as text, the same skill concept will have different embedding representations in different languages, which introduces more noise to the model. We decided to leverage the entity embedding as it is a more stable representation across languages.
The new shared model had outperformed the single-task models in both job and resume tasks and resulted in more relevant job recommendation matches between jobs and members. In production, it has led to statistically significant improvements in user engagement, such as more weekly active users (WAU), sessions, and higher click through rate (CTR).
Cross domain member company affinity learning
For the LinkedIn recruiter search service, we want to surface relevant members in the search result to garner more InMail acceptances and gains in confirmed hires. These metric gains are driven by a two-way selection between job candidates and recruiters and would benefit from a better characterization of recruiter-candidate affinity, informed by user activities across different but related sites on the LinkedIn ecosystem. Using only user history on the recruiter search service has limitations. One example is that a potential candidate has not interacted with any recruiters in the past few days, yet they have been actively applying to certain jobs. Incorporating user activities on LinkedIn’s job search and recommendation sites would help recruiters reach out to a broader pool of interested candidates, further driving the key InMail and hiring metrics. Compared to using pure activity count features, user representations trained on multitask deep neural networks provide a greater flexibility of encoding diverse activity signals optimized for given targets. Additionally, training on multitask deep neural networks provides an improved abilitly to generalize relevant items that have not shown up in the search history.
For the LinkedIn recruiter search product, embeddings that represent the similarity between recruiters and general LinkedIn members, who are potential job candidates, would help improve search relevance. Besides data from recruiter search itself, members’ job preferences also affect key metrics such as InMail acceptance. We co-trained member and company embeddings using a multitask deep neural network with a shared bottom for transfer learning between recruiter InMail accept, job apply, and job view tasks.
The output member and company embeddings represent members’ affinity for recruiters’ companies. The label data is obtained from user activity tracking and are split at a particular historical time for training/validation. We use cross-entropy loss weighted over different tasks. Two variants of models are experimented, one with only member profile features, and the other with a host of additional member activity features that represent members’ job search intentions. With the added member activity features, we observe more positive transfer and the output embeddings bring more lift in the recruiter search model. The member and company embeddings have been ramped to Recruiter Search production, bringing significant gains to key metrics such as InMail accepts and predicted confirmed hires.
In this post, we have introduced our approach to a multitask learning framework and how we applied it to heterogeneous tasks in various product domains. The application of multitask learning has shown to improve model performance with significant product impact. We find that the success of applying multitask learning relies on choosing the relevant tasks, which currently depends on domain knowledge and intuition. In the future, we’d like to explore how to scale multitask learning to a large number of modeling tasks, creating a way to automatically identify tasks that can be effectively learned together.
Sen Zhou, Anastasiya Karpovich from the AI Foundation Team contributed to develop Translearn for foundational support of different training paradigms and of iterative training. Ji Yan, Peide Zhong and Xu Dan from the Enterprise Standardization team contributed to add support for joint training. Dansong Zhang, Tong Zhou from the AI Foundation Team and Sang Wook Park from Hirer AI developed the AI models and conducted A/B experiments for cross domain member company affinity learning. Raochuan Fan and Ji Yan developed the AI model and conducted A/B experiments for cross domain skill understanding. Zhewei Shi and the ProML team provided support for Translearn integration with LinkedIn infrastructure.
Thanks for the management support from Tie Wang, Lei Zhang, Jimmy Guo, Zheng Li, Souvik Ghosh and technical discussion with Jaewon Yang and Amol Ghoting.