Resume Assistant: Finding High-Quality Work Experience Examples
August 31, 2018
The Microsoft acquisition of LinkedIn brought many opportunities for combining technologies to do something new and useful. One such endeavor is the Resume Assistant, designed to help people write better resumes.
In our Part 1 blog on Resume Assistant, we shared the engineering and design challenges we faced during the product creation and how we solved them together with the team at Microsoft. This blog focuses on the AI/relevance component, where the team was faced with the challenge of finding the best work experience descriptions to display in the product from millions of possible examples to choose from—no easy feat.
We needed to identify what general characteristics define a good resume work experience description and, given a job title, to find the best examples for each title. Our task was: given a title query and the LinkedIn Knowledge Graph, which contains, among other things, member profiles with titles and associated work experience descriptions, select the best k examples of work experience descriptions for the input query title.
Processing takes place offline and consists of two main steps: candidate generation, to generate an initial list of candidate work experience descriptions for each title in the Knowledge Graph, and candidate ranking, which ranks these descriptions according to their quality. Our definition of quality is based on features of the description such as its content and structure, independent of the title.
Based on LinkedIn member profiles, we generated an initial list of candidate work experience descriptions. This step consists of applying various hard filters, including the application of privacy preferences, and results in a cleaner set of viable candidate work experience descriptions. It’s important to note that LinkedIn only considers profiles where privacy settings are set to public. Additionally, members can opt out and their data will be excluded in this filtering step.
An early error analysis revealed that about 8% of the errors we were seeing occurred when work experience descriptions did not describe someone’s work experience but instead described a company or product. It wouldn’t be useful to surface these descriptions in the Resume Assistant product, so we trained a binary text classifier on LinkedIn company descriptions (which were given the "company" label) and LinkedIn member work experience descriptions (which were given the "work experience description" label). Even though the member work experience description data was noisy insofar as it contains a significant amount of “company” descriptions, the model was able to generalize successfully. All work experience examples predicted as “company” were then filtered out of the data.
Once we removed the examples that we definitely did not want to surface in the product, we needed to rank the remaining descriptions. We used a gradient-boosted decision tree classifier that predicted "good" or "bad" labels given a work experience description. To obtain a ranking score, we simply used the distribution over the classes returned by the classifier: the score returned by the ranker is the probability of the "good" label being assigned, given the input text.
Data for training the model was created by an in-house linguist team (details of the annotation task are described below) and consisted of a label (good/bad) derived from human judgements about the quality of the work experience descriptions. We had a very small training set and to avoid overfitting the data, we trained a very simple model with a few features based mostly on the structural characteristics of the text.
Much of the effort in this project was dedicated to coming up with effective ways to evaluate the model output. Once launched in production, we would have data from users that we could use to evaluate and improve our model. Pre-launch, however, we needed a way to evaluate how our models were doing, so we devised a task for human annotators to judge work experience descriptions. Even post-launch, we continue to use this human evaluation task because it provides a complementary validation of model output to pair with what we get from tracking data and other user feedback.
We needed to establish what constituted high-quality text in the Resume Assistant context. As well as being asked to give an overall judgement (on a four-point scale), annotators were asked a number of additional questions relating to particular elements of a work experience description (for example: “Does the description contain examples of achievements? Does the description contain any quantification of results?”). We aggregated the answers to produce a quality score for each example. We found that the additional questions helped to make the final annotations more consistent and resulted in a more useful, fine-grained ranking. It also allowed us to tune the final quality score based on what was deemed most important in terms of quality from a product perspective.
Training data generation for ranking model
Early versions of our data pipeline used heuristic methods for ranking work experience descriptions. We were able to use the annotated data created during the evaluation of these pipelines to create a small training set, which we used to train the ranking model. The training set was augmented with additional randomly-selected data, annotated by human annotators in the same fashion as described above.
The manual evaluations by our linguist team resulted in a quality score for each work experience description in a sample. Quality scores range from 0 to 1. A zero score indicates a very poor-quality description; the best possible score is 1. For each evaluation, we selected a number of job titles and ran the model on all the data in the Knowledge base (i.e. the LinkedIn Economic Graph) to retrieve the top k work experience examples for each title. These top k results were then evaluated. Each histogram below displays the results for one such evaluation run. Improvements in data quality over time can be observed in the gradual improvement of the distribution of quality scores in the histograms for the three models displayed here.
Model 2 (final ranking model trained on human-annotated data)
We continue to improve the ranking of work experience descriptions. Our current focus is on scaling to multiple languages—using transfer learning to adapt our English models to other languages and exploring techniques for improving liquidity for job titles for which we have too few examples. Watch this space!
Thanks to product manager Kylan Nieh for his enthusiasm and leadership pulling everything together to make this happen. I also would like to extend thanks to Morgan Zhang and Hang Zhang from the apps team for building the backend infrastructure and for their work on the data pipeline; to Haoran Wang for training the multi-lingual classifiers and setting up the A/B testing environment; to Jeff Pasternack for his excellent machine learning framework. Last but not least, thanks to linguists Lauren Gage and Ana Garotti for leading the human evaluation and data collection tasks and for error analyses, and to the rest of the linguist team for their careful work annotating the data.