Using the LinkedIn Fairness Toolkit in large-scale AI systems
February 8, 2021
LinkedIn’s vision to create economic opportunity for every member of the global workforce would be impossible to realize without leveraging AI at scale. We use AI in our core product offerings to: highlight new job openings for job seekers, surface relevant news articles for members, and help workers grow their network by recommending new connections.
Accompanying our use of AI, we have a company-wide commitment to responsible product design that is embodied in our guiding principles, company culture, and the work of our Engineering teams. Having a positive intent is not enough; we also need to know that we are having a positive impact. To that end, our engineering teams have committed to designing or updating the AI systems that power LinkedIn to be as fair as possible.
To scale these efforts, we recently developed and open-sourced the LinkedIn Fairness Toolkit (LiFT). LiFT makes use of commonly-considered fairness definitions to enable the measurement of fairness in large-scale machine learning workflows. It can be deployed in training and scoring workflows to measure biases in training data, evaluate different fairness notions for ML models, and detect statistically significant differences in model performance across different subgroups. Since its initial release, we have been hard at work to incorporate LiFT into our AI systems at LinkedIn and add more functionality. In this blog post, we describe some of these recent advancements and provide updates on new implementations of our fairness work in our products.
How LiFT is being used at scale in LinkedIn
At LinkedIn, we provide LiFT as a tool for engineering teams to evaluate the fairness of training data and AI models. Designing such a generic tool for multiple products requires that it includes the fairness definitions that are most applicable to the real-world challenges that our coworkers are trying to address. We discuss some of these fairness definitions and other important considerations for the use of LiFT in production in the case study below.
LiFTing unfairness out of large-scale AI systems
A guiding tenet for LinkedIn’s fairness efforts is ensuring "two members who are equally qualified should have equal access to opportunity." While the interpretation of “equally qualified” can differ greatly by product, and assessing the degree to which a member is qualified can be incredibly challenging, the outcomes experienced by members can be helpful in auditing whether a product is having a fair or unfair impact.
For example, People You May Know (PYMK) provides suggestions for members to connect with others, in order to build their network for professional activities like mentoring and job-seeking. In this system, a member sending an invitation to connect with a recommended match can be viewed as a positive outcome and a “qualified” recommendation.
The AI algorithms that underpin our PYMK recommendations learn from recommendations that result in successful matches. Frequent members (FMs; members who are more engaged on LinkedIn) tend to have greater representation in the data used to train these algorithms than their less active counterparts, infrequent members (IMs). As a result, the algorithms can become biased against infrequent members, and, more alarmingly, this behavior can worsen over time. Frequent members, who have better representation, are typically placed at the top of recommendations. Subsequently, these members can make even more connections, giving them further representation in the training data. Figure 1 illustrates the cycle causing this “rich-get-richer” phenomenon.
Figure 1: Bias in recommendation systems can come in through different mechanisms and be potentially reinforced over time.
So how can we ensure that PYMK is fairly representing members from both groups and avoid reinforcing existing biases in networking behavior?
The first step is choosing a robust definition of fairness. Three of the most widely used fairness definitions are equality of opportunity, equalized odds, and predictive rate parity. Equality of opportunity suggests that randomly chosen “qualified” candidates should be represented equally regardless of which group they belong to; in other words, the exposure of qualified candidates from any group should be equal. Equalized odds takes this definition a step further and requires that both “qualified” and “unqualified” candidates are treated similarly across groups, providing equal exposure to both qualified and unqualified members. Predictive rate parity ensures that the score from the algorithm predicts a candidate's "quality" with equal precision across groups. While these definitions of fairness can be conflicting, the right definition to choose is often use-case specific. A more complete discussion of the considerations to make when choosing a fairness metric is given here.
We have been working towards mitigation techniques for each of these definitions, hoping to provide practitioners with the tools they need for their applications. Fairness mitigation strategies commonly fall in one of three categories: pre-, in-, or post-processing.
Pre-processing techniques massage the training data used to develop models in hopes that reducing bias at this stage will lead to a fair model.
In-processing involves modifying model training algorithms to produce models that yield unbiased results.
Post-processing methods, of which our re-rankers are examples, transform the scores produced by a model in a way that gives fairness.
Post-processing mitigation approaches have the advantage that they are model agnostic, in the sense that they depend only on scores provided by a model. This flexibility affords engineers the ability to adjust the output of virtually any model to be fair, versus other approaches that are more application-specific. In 2018, we used post-process reranking in LinkedIn Recruiter search to ensure that each page of results is gender-representative. Since this initial foray into post-processing re-ranking based on exposure, we have developed and tested methods to re-rank according to equality of opportunity and equalized odds, which we have applied to the PYMK problem of fairly representing infrequent and frequent members.
In PYMK, we chose to actively prevent the “rich-get-richer” phenomenon by giving qualified IMs and FMs equal representation in recommendations. In doing so, we saw more invites sent to IMs without any adverse impact on FMs: +5.44% in invitations sent to IMs and +4.8% in connections made by IMs, while remaining neutral on the respective metrics for FMs. This is an interesting result because typically, when invites are shifted from the FM group to the IM group, we would expect to see a metric increase for the latter and a decrease for the former. However, we observed neutral metrics for FMs and positive metrics for IMs, which indicates that recommendation quality has improved overall.
Of course, our fairness detection and mitigation efforts have extended beyond this illustrative example of fairness for frequent and infrequent members. A primary function of LiFT as used within LinkedIn will be to measure fairness to groups identified through protected attributes (such as age and gender). These applications come with additional privacy concerns, and we discuss one method for improving the anonymity of our system below.
Keeping demographic data anonymous
When dealing with any aspect of member data, it is of utmost importance to maintain the privacy of our members, especially their protected attributes. A core consideration for using LiFT internally has been developing a system that can provide all of our AI teams with insight into the fairness of their models without allowing each individual team access to protected attribute data.
To solve this problem, we employ a simple client-server architecture, where the fairness evaluation is performed on a server which has access to Personally Identifiable Information (PII) containing protected attribute data. Each AI team (the client side) is provided the tool as a pluggable component, which an AI engineer can configure to submit a model evaluation request. The server processes the request, and returns the analysis result to the client, without exposing the protected attribute information to the client. The server runs the fair analyzer library that supports LiFT. With this setup, member privacy is respected, in keeping with our ongoing commitment to responsible behavior.
As we are continuing to develop and experiment with fairness mitigation approaches, we are working towards open-sourcing the most successful methodologies, including the post-processing techniques for equality of opportunity and equalized odds discussed above, as a new module of LiFT. We plan to leverage these techniques across all AI products to make LinkedIn a fairer platform. Finally, we also plan to continue adding large-scale fairness metrics into LiFT, including some stemming from our recent research.
We would like to thank Baolei Li, who contributed to the development of the LiFT Service framework for easy adoption within LinkedIn, and to Ram Swaminathan, Romer Rosales, and Igor Perisic for their continued support of this project. We would also like to thank Sriram Vasudevan, Guillaume Saint-Jacques, YinYin Yu, Allison Liu, Kexin Fei, Albert Cui, and other members of the equity data group, for the deep discussions about this work and related topics.