Addressing bias in large-scale AI applications: The LinkedIn Fairness Toolkit

Sriram Vasudevan

Machine Learning at LinkedIn

August 25, 2020

Co-authors: Sriram Vasudevan, Cyrus DiCiccio, and Kinjal Basu

At LinkedIn, our imperative is to create economic opportunity for every member of the global workforce, something that would be impossible to accomplish without leveraging AI at scale. We help members and customers make decisions by providing them with the most relevant insights based on the available data (e.g., the job listings that might be a good fit for their skills, or content that might be most relevant to their career). Along with the rest of the industry, our AI models use both implicit and explicit feedback in order to make these predictions.

News headlines and academic research have emphasized that widespread societal injustice based on human biases can be reflected both in the data that is used to train AI models and the models themselves. Research has also shown that models affected by these societal biases can ultimately serve to reinforce those biases and perpetuate discrimination against certain groups. Sadly, these examples persist even in models used to inform high-stakes decisions in high-risk fields, such as criminal justice and health care, owing to a range of complex historical and social factors.

At LinkedIn, we are working toward creating a more equitable platform by avoiding harmful biases in our models and ensuring that people with equal talent have equal access to job opportunities. In this post, we share the methodology we’ve developed to detect and monitor bias in our AI-driven products as part of our product design lifecycle.

Today’s announcement is the latest in a series of broader R&D efforts to avoid harmful bias on our platform, including Project Every Member. It is also a logical extension of our earlier efforts in fairness, privacy, and transparency in our AI systems, as well as “diversity by design” in LinkedIn Recruiter. Furthermore, there are additional company-wide efforts that extend beyond the scope of product design to help address these issues and close the network gap.

Towards fairness in AI-driven product design

There are numerous definitions of fairness for AI models, including disparate impact, disparate treatment, and demographic parity, each of which captures a different aspect of fairness to the users. Continuously monitoring deployed models and determining whether the performance is fair along these definitions is an essential first step towards providing a fair member experience.

Although several open source libraries tackle such fairness-related problems (FairLearn, IBM Fairness 360 Toolkit, ML-Fairness-Gym, FAT-Forensics), these either do not specifically address large-scale problems (and the inherent challenges that come with such scale) or they are tied to a specific cloud environment. To this end, we developed and are now open sourcing the LinkedIn Fairness Toolkit (LiFT), a Scala/Spark library that enables the measurement of fairness, according to a multitude of fairness definitions, in large-scale machine learning workflows.

Introducing the LinkedIn Fairness Toolkit (LiFT)

The LinkedIn Fairness Toolkit (LiFT) library has broad utility for organizations who wish to conduct regular analyses of the fairness of their own models and data.

It can be deployed in training and scoring workflows to measure biases in training data, evaluate different fairness notions for ML models, and detect statistically significant differences in their performance across different subgroups. It can also be used for ad hoc fairness analysis or as part of a large-scale A/B testing system.
Current metrics supported measure: different kinds of distances between observed and expected probability distributions, traditional fairness metrics (e.g., demographic parity, equalized odds), and fairness measures that capture a notion of skew like Generalized Entropy Index, Theil’s Indices, and Atkinson’s Index.
LiFT also introduces a novel metric-agnostic permutation testing framework that detects statistically significant differences in model performance (as measured according to any given assessment metric) across different subgroups. This testing methodology will appear at KDD 2020.

In the remainder of this post, we will provide a high-level overview of various aspects of LiFT’s design, then delve into the details of our permutation testing methodology and discuss how it overcomes the limitations of conventional permutation tests (and other fairness metrics). Finally, we’ll share some of our thoughts around future work.

The LinkedIn Fairness Toolkit (LiFT)

To enable deployments in web-scale ML systems, we built LiFT to be:

Flexible: It is usable for exploratory analyses (e.g., with Jupyter notebooks) and can be deployed in production ML workflows as well. The library comprises bias measurement components that can be integrated into different stages of an ML training and serving system.
Scalable: Computation can be distributed over several nodes to scale bias measurement to large datasets. It leverages Apache Spark to ensure that it can operate on datasets stored on distributed file systems while achieving data parallelism and fault tolerance. Utilizing Spark also provides compatibility with a variety of offline compute systems, ML frameworks, and cloud providers, for maximum flexibility.

Flexibility
To enable its use in ad hoc exploratory settings as well as in production workflows and ML pipelines, LiFT is designed as a reusable library at its core, with wrappers and a configuration language meant for deployment. This provides users with multiple interfaces to interact with the library, depending on their use case.

Figure 1: Interaction of the LinkedIn Fairness Toolkit (LiFT) with external systems. The flowchart shows how configuration-driven Spark jobs and ML plugins enable fairness metric computation in user workflows and ML systems, respectively.

As shown in Figure 1, for users who prefer to interact with LiFT at its highest level, the library provides a basic driver program powered by a simple configuration, allowing quick and easy deployment in production workflows. This enables fairness measurement for datasets and models without the need to write code and related unit tests. At LinkedIn, LiFT also integrates with our in-house ML training system, and this integrated version accepts the same user-provided configurations. This way, teams can move between custom workflows and LinkedIn’s ML framework without having to rewrite their LiFT configurations (Figure 2).

Figure 2: Example of a user-provided configuration

To enable use cases where developers need to interact directly with the library, LiFT also provides access to higher-level and lower-level APIs that can be used to compute fairness metrics at various levels of granularity, with the ability to extend key classes to enable custom computation. Figure 3 provides an overview of how these APIs and classes are organized within the library. The high-level APIs can be used to compute the whole host of metrics available, with parameterization handled by appropriate configuration classes. The low-level APIs enable users to integrate just a few metrics into their applications, or extend the provided classes to compute custom metrics.

As mentioned earlier, LiFT also provides a permutation test that is performance metric agnostic. Metrics available out-of-the box (like Precision, Recall, False Positive Rate (FPR), and Area Under the ROC Curve (AUC)) can be used with this test, and a CustomMetric class is also provided for users to extend, so that they can define their own User Defined Functions (UDFs) to plug into this test. While LiFT natively supports classification metrics, the CustomMetric class enables users to also define metrics that allow it to be used in ranking scenarios, as well.

Figure 3: Design of the LinkedIn Fairness Toolkit (LiFT), showing the interaction between high- and low-level APIs and classes

Scalability
LiFT leverages Apache Spark to scale the computation of fairness metrics over large-scale datasets. Specifically, it uses the following techniques:

Single job to load data files into in-memory, fault-tolerant, and scalable data structures.
Strategic caching of datasets and any pre-computation performed.
Balancing distributed computation with a single system execution to obtain a good mix of scalability and speed.

The datasets are loaded into Spark DataFrames, with only the primary key, labels, predictions, and protected attributes being projected and cached. Data distributions and benefit vectors (see Figure 3) are computed in a distributed manner across a given set of dimensions (such as age, gender, and race), and the results are stored on a single system in-memory, to enable subsequent fairness metric computations to be performed quickly and efficiently. Users can operate on these precomputed distributions for quick and easy computation, or deal with the projected and cached dataset for more involved metrics.

For more detail on how we optimize LiFT for more efficient computation of different kinds of tests and analyses, see the preprint of our CIKM 2020 paper on computing fairness metrics at scale.

Example outputs
In order to accommodate the variety of metrics measured, LiFT makes use of a generic FairnessResult case class to capture results. Shown in Figure 4 is the corresponding Avro schema used to write out these results when the included driver program is used. Figures 5 and 6 show examples of dataset and model metrics computed using an example dataset.

Figure 4: Schema used to write out the LinkedIn Fairness Toolkit (LiFT)’s metrics

Figure 5: Examples of dataset metrics computed

Figure 6: Examples of model metrics computed

Evaluating fairness using permutation tests

Permutation tests are a classic statistical tool for comparing populations. They are a type of significance test of the hypothesis that two populations are identical. They are performed by computing a test statistic across all “permutations” of the dataset attained by shuffling the population labels, and determining how extreme the original test statistic is among all such recomputed test statistics.

While a seemingly obvious choice for comparing groups of members, permutation tests can fail to provide accurate directional decisions regarding fairness. That is, when rejecting a test that two populations are identical, the practitioner cannot necessarily conclude that a model is performing better for one population compared with another. LiFT implements a modified version of permutation tests that is appropriate for assessing the fairness of a machine learning model across groups of users, allowing practitioners to draw meaningful conclusions. For a further discussion of this topic, we recommend reviewing the KDD 2020 paper describing these tests in detail.

Ongoing work

The LinkedIn Fairness Toolkit (LiFT) has already been used to measure the fairness metrics of training datasets for several models prior to their training at LinkedIn. Although we only have a limited subset of member demographic data, our use of the tool has been encouraging so far.

LiFT’s permutation tests determined, after an exploratory analysis, that our Job Search model provides equally valid results with regard to gender. That is, the model in production showed no significant difference in the probability of ranking a positive/relevant result above an irrelevant result between men and women.
We have also been using it for the last year in the development of our prototype anti-harassment classification systems. We looked at the model’s performance across geographic regions and gender. While the model showed the highest precision when identifying harassing content in messages from men, across all regions, we found that the model was slightly more precise among English-speaking female members in the U.S. versus those in the U.K. The system relies on human-in-the-loop verification of flagged content, so this was determined to be an acceptable design trade-off.

We aim to increase the number of pipelines where we’re measuring and mitigating bias on an ongoing basis through deeper integration of LiFT with our Pro-ML stack. There is also foundational work being done by the AI team at LinkedIn to develop new techniques that measure fairness and mitigate bias at scale for recommendation systems and ranking problems (arXiv preprint). Special focus is also being placed on two-sided marketplace problems (arXiv preprint). We hope to also open source these efforts as part of the LinkedIn Fairness Toolkit (LiFT) library.

The LinkedIn Fairness Toolkit (LiFT) is now open source

LiFT has been open sourced and is now available on GitHub. For more details on the library and the various metrics supported, we invite you to take a look at our GitHub documentation for the most up-to-date information. We welcome contributions to grow the repertoire of algorithms supported.

Acknowledgments

We would like to thank Krishnaram Kenthapadi, who contributed to the development of the LinkedIn Fairness Toolkit (LiFT) while at LinkedIn, and to Ram Swaminathan, Deepak Agarwal, and Igor Perisic for their support of this project. We would also like to thank the Core Fairness AI Team members, Noureddine El Karoui, Preetam Nandy, Amir Sepehri, Heloise Logan, Yunsong Meng, and Souvik Ghosh, for the deep discussions about this work and related topics. And finally, thanks to Carlos Faham, Divya Gadde, Girish Kathalagiri, Grace Tang, Romil Bansal, and Xin Wang for working with us through the development and deployment of LiFT.

Topics: Artificial intelligence Open Source Data Machine Learning