Using Economic Graph Data to Power the LinkedIn Salary Product
December 14, 2018
Online professional social networks and job platforms, such as LinkedIn, play a key role in ensuring an efficient labor marketplace by connecting talent (job seekers) with opportunities (jobs). Studies show that salary is an important factor when looking for new opportunities, but salary information isn’t always as readily apparent as, say, the job location. Products such as LinkedIn Salary have the potential to reduce asymmetry of compensation knowledge, and to serve as market-perfecting tools for job seekers and job providers.
The LinkedIn Salary product, launched in Nov. 2016, allows members to explore compensation insights by searching for different titles and regions. For each (title, region) combination, we present the distribution of base salary, bonus, and other types of compensation, the variation of pay based on factors such as experience, education, company size, and industry, and the highest-paying regions, industries, and companies. These insights are generated based on data collected from LinkedIn members using a combination of techniques to preserve user privacy (such as encryption, access control, de-identification, aggregation, and thresholding) and modeling techniques (such as outlier detection and Bayesian hierarchical smoothing) for ensuring robust, reliable insights.
A key challenge in this application is the simultaneous need for ensuring sufficient breadth of product coverage (having insights to satisfy as many job seekers as possible) while having the depth of data to provide robust, reliable compensation insights. At the time of launch, the product only allowed LinkedIn members to discover compensation insights by searching for a title and a region. However, user feedback indicated that a large number of LinkedIn members were interested in learning about compensation insights at the company level. Consequently, there was a strong desire to generate compensation insights for as many (title, region, company) cohorts as possible, and to make such insight pages available as part of the product user experience. While the existing system used statistical modeling techniques to compute robust insights, a crucial limitation was that the insights were provided only for cohorts with multiple member submissions. As a result, the existing system could provide insights for only about 30K (title, region, company) cohorts, only covering a small fraction of LinkedIn's monthly active users. Such low coverage created a poor user experience and made it impossible to include company-level insight pages as part of the product.
In this blog post, we describe our approach to solving the problem of reliably inferring compensation insights at the company level—in other words, predicting insights for (title, region, company) cohorts even when there is no member-submitted data for the cohort. The intuition underlying our approach is that two companies can be considered similar if employees are very likely to transition from one company to the other and vice versa. In the context of computing compensation insights, this assumption is rooted in the observation that job transitions typically result in higher pay: in a study of over 5,000 job moves, 63% resulted in the same or higher base pay, with a 2.1% average pay raise for those who moved to a different company.
Our solution mines the rich information present in the LinkedIn Economic Graph to generate a novel, semantic representation (embedding) of companies. We designed an algorithm for learning company embeddings from LinkedIn members' company transition data (Company2vec). We then computed pairwise similarity values between companies based on these embeddings, and subsequently defined a peer company group for each company as the set of most similar companies. Finally, we incorporated company similarities as part of a Bayesian statistical model to predict insights at the company level.
By employing this model, we have significantly increased the coverage of insights while even slightly improving the quality of the obtained insights. As an example, our techniques enable the computation of base salary insights for 35 times as many (title, region, company) combinations in the U.S. as compared to previous work. We have integrated this system as part of the existing LinkedIn Salary modeling architecture, and also created a standalone interface for other LinkedIn applications to access and benefit from these insights and the peer company group information (e.g., for improved job seniority filtering in the job recommendation application).
The LinkedIn Salary product enables users to explore compensation insights (e.g., percentiles and histograms) for different titles, locations, and companies. The insights are based on the compensation data that we have been collecting from corresponding LinkedIn members (using a give-to-get model), which are then processed using techniques such as encryption, access control, de-identification, thresholding, aggregation, and outlier detection to ensure member privacy and data quality. Due to privacy requirements, the salary modeling system can only access cohort-level data containing aggregated compensation submissions (e.g., compensation entries for Software Engineers working at LinkedIn in San Francisco Bay Area), limited to those cohorts that contain a minimum number of entries. Since the empirical percentile estimates are not reliable for those cohorts with very little data, the existing system used a Bayesian hierarchical smoothing methodology, which exploited the hierarchical structure amongst the cohorts and “borrowed strength” from the ancestral cohorts to derive estimates for smaller-sized cohorts. First, the ancestral cohort that could “best explain” the observed entries in the given cohort was chosen as the “best ancestor.” The data from this ancestral cohort was used as the prior, and the posterior of the cohort of interest was obtained based on the prior and the observed entries. However, this methodology was designed only for cohorts with member-submitted entries and could not be used to provide reliable insights for (title, region, company) cohorts with no data at all. The number of (title, region, company) cohorts with member-provided data is quite small, which meant that the “borrowing strength” based method was not well-suited to increase the product coverage. Since demand was high for this feature, we needed a different solution.
To address this goal, we have built a modeling system consisting of two components: (1) Computation of pairwise company similarity and peer company groups based on LinkedIn's Economic Graph data containing company transitions by LinkedIn members, and (2) Inference of compensation insights for (title, region, company) cohorts with no data using a Bayesian statistical model that utilizes the company similarity and peer company group information. We next highlight the key modeling challenges that are addressed by our framework.
Model and algorithms
Peer company computation for salary modeling
We next present our approach for computing peer company groups, which can serve as an intermediary level between (title, region) and (title, region, company) in the hierarchy, and thereby help obtain better prior compensation estimates for a company-level cohort. We consider two companies to be similar if employees are very likely to move from one company to the other and vice versa. We assume that in the absence of any other information, keeping the title and the location fixed, for a given company, the set of companies whose employees have transitioned to and from this company can provide reasonable guidance on the compensation at this company. This assumption is based on the observation that job changes typically result in the same or higher pay.
Member company transitions: At LinkedIn, we have rich company transition data from member profiles. As part of a given member’s LinkedIn profile, the “Experience” section collects their work experiences, and each entry contains information such as company, position, and start and end dates. For each member, we arrange work experiences into a list of company transitions in time order, and use the transitions as positive samples in the training data. For example, if a member lists consecutive work experiences in Companies A, B, C in time order with no overlap, then Company A to B transition and B to C transition are marked as positive in training.
Definitions (peer company and peer score): Two companies A and B are peer companies if company A is among the top choices for employees in company B to transition to, and vice versa. This similarity between companies A and B is measured via a peer score, defined as the product of normalized transition probabilities of both transition directions (A to B and B to A). Based on the assumption that a transition leads to an increase in salary, a mutual transition tendency indicates a similarity in compensations between the two companies.
Calculate peer company score and generate peer company group: We developed an algorithm (Company2vec) for learning company embeddings from LinkedIn members' company transition data that uses techniques such as negative sampling and stochastic gradient descent to map each company to its latent representations.
Since our definition of peer companies considers the directed transitions in both directions, yet a company may act differently as a transition origin or destination, origin and destination embeddings are distinguished and modeled separately for each company.
From these embeddings, we compute the peer scores as defined. With peer score as a similarity measure, we then generate for each company a list of its peer companies ranked by peer scores in descending order and filtered with a minimum score value. This is the set of most similar companies and is used as part of the Bayesian statistical model for smoothing and inferring company-level insights. Note that we analyze the (company A to company B) transitions instead of ((title A, region A, company A) to (title B, region B, company B)) transitions due to lack of enough support at finer granularities, and also for simplicity of modeling. Please refer to our ACM KDD 2018 paper for more technical details.
Bayesian model for inferring insights for empty title-region-company cohorts
We next present a flexible Bayesian statistical model for predicting the compensation range for empty (title, region, company) cohorts, utilizing both the company-related information present in member-submitted compensation data and company similarities mined from LinkedIn members' company transition data using the Company2vec technique.
Decoupling/Recoupling: The main idea is to decouple the submitted (title, region, company) compensation data into two components: 1) (title, region)-wise compensation and 2) company-wise compensation adjustments, study them separately, and then integrate the inferences from both models together to obtain predictions for (title, region, company) compensation. There is a lot of heterogeneity both in compensation for the same title for different regions (e.g., Software Engineers in San Francisco vs. New York), and in compensation for the same region for different titles (e.g., Software Engineers vs. Nurses in New York). Therefore, instead of using a title-only or region-only component, we chose (title, region) as an integrated component in decoupling. The (title, region) component leverages a regression-model-based prediction approach from previous work, while the company component is modeled via a Bayesian model where a company is smoothed with peer company compensation data if there are enough submissions to its peer companies, regardless of which (title, region) the submissions are from, and smoothed by global information of all submitted compensation data otherwise. We then recouple results from both the (title, region) and company components to generate predictions for (title, region, company) compensation insights using statistical tools.
Bayesian smoothing with peer company: We use a Bayesian model since it provides a flexible structure for incorporating external knowledge in the form of a prior. For a specific company to be studied, all company-adjusted component data of that company is studied, regardless of the title or region the data belongs to. The prior mean and variance of the company component are set to be centered at either peer company information or global information. In the case of the former, peer company information is estimated from all company adjustments of the company's peer companies. The prior mean and variance can also be chosen to be global information estimated by all the compensation submission entries over the set of all companies. In our application, a company's prior is chosen to be peer company information when the size of its peer company group is no smaller than a certain chosen threshold, and centered at global information otherwise. The data can be modeled as a normal distribution with a conjugate normal-inverse-Gamma prior, via which we can get a marginal estimate as t-distributed random variable. Please refer to our ACM KDD 2018 paper for more technical details.
We next present the challenges encountered and the lessons learned through the production deployment of our computation system as part of the LinkedIn Salary platform for more than one year.
Similarity score for peer company modeling
We first explored using a Word2vec-based similarity measure (cosine similarity) between two companies for peer company modeling. However, we noticed that this measure does not differentiate between moving from Company A to Company B and moving in the opposite direction, and further does not model the combined transition probability in both directions. Hence, instead of adopting existing similarity measures, we introduced the new notion of peer score, which is specifically designed for our application scenario. For each pair of companies, the peer score jointly models both transition directions, and uses appropriate normalization to eliminate influence from the scale of a company.
Filtering by LinkedIn member information
Although we can predict the compensation range for a large number of (title, region, company) cohorts by taking the cross product between all (title, region) cohorts and all company components, many of these combinations may not even exist. For example, in our initial results, our model had predicted the compensation range for software engineer positions at LinkedIn in many regions where LinkedIn did not have any presence. Since such cohorts do not provide meaningful value to LinkedIn members and can be thought of as adding noise to the ecosystem, we decided to keep only those cohorts that map to a sufficient number of LinkedIn members. We experimented with different thresholds on the minimum number of members needed for a cohort to be considered valid, and computed the product coverage for each threshold. We then chose a threshold such that the threshold helps to eliminate spurious cohorts while retaining a significant increase in the product coverage.
Given the desire to improve the quality as well as the coverage of compensation insights to benefit more LinkedIn members, we have been pursuing several modeling and engineering directions to extend this work. We plan to develop tools to collect inputs and feedback from recruiters, who typically have better knowledge of the compensation range for their function and industry. Such inputs could be useful for diagnosing cohorts with incorrectly-predicted compensation ranges, and potentially for correcting them. We would also like to incorporate richer features, such as years of experience and skills, as part of the prediction model, and detect and correct sample selection bias, response bias, and other biases using statistical techniques. Moreover, as compensation can change over time due to supply/demand changes, inflation, and other economic factors, we would like to take time into consideration when computing salary insights and explore approaches such as discounting and/or appropriately scaling old salary submissions, and building time series models. Finally, we would like to provide personalized compensation estimates for LinkedIn members, by taking into consideration each member's work experience, education background, skills, and other relevant attributes.
This blog post is based on an ACM KDD 2018 paper co-authored by Xi Chen, Yiqun Liu, Liang Zhang, and Krishnaram Kenthapadi. We would like to thank all other members of the LinkedIn Salary team for their collaboration for deploying our system as part of the launched product, and Stuart Ambler, Keren Baruch, Kinjal Basu, Rupesh Gupta, Santosh Kumar Kancha, Myunghwan Kim, Ram Swaminathan, and Ganesh Venkataraman for insightful feedback and discussions.