Fairness, Privacy, and Transparency by Design in AI/ML Systems
July 26, 2019
Co-authors: Stuart Ambler, Ahsan Chudhary, Mark Dietz, Sahin Cem Geyik, Krishnaram Kenthapadi, Ian Koeppe, Varun Mithal, Guillaume Saint-Jacques, Amir Sepehri, Thanh Tran, and Sriram Vasudevan
Editor’s note: A shorter version of this article was originally posted by Krishnaram Kenthapadi on LinkedIn.
How do we take fairness and transparency into account while developing machine-learned models and systems? How do we protect the privacy of users when building large-scale, AI-based systems? Algorithmic models and techniques are now increasingly used as part of decision making in a variety of internet applications due to factors such as ubiquitous connectivity, the ability to collect, aggregate, and process large amounts of fine-grained data using cloud computing, and ease of access to applying sophisticated machine learning models. Accordingly, discussion of the fairness, privacy, and security implications of these systems is now a mainstream topic of discussion in contemporary cultural discourse, to say nothing of prior and emerging academic work.
Recent news headlines and studies about AI systems and general social inequalities illustrate the need not only for privacy rights, but also consideration for related dimensions such as fairness, accountability, and explainability of AI/ML systems. By ensuring fairness, accountability, confidentiality, and transparency for users in such applications, we can help to enhance trust and long-term engagement of users.
AI at LinkedIn
LinkedIn aspires to connect talent with opportunity at scale by making use of LinkedIn’s Economic Graph, and the AI team at LinkedIn focuses on a mission of delivering the right information to the right user at the right time through the right channel. In terms of scale, we process petabytes of data offline and petabytes of data nearline every day, have several billions of parameters across our ML models, and perform hundreds of A/B experiments every week. In terms of our philosophical approach to the field, members of the LinkedIn Data organization have published several blog posts over the years that discuss the ethics of the data science process in general, specific challenges in AI ethics, and general philosophical thoughts on data-augmented intelligence and decision making.
We emphasize these past efforts to show that building fair, secure, and privacy-preserving AI is in strong alignment with LinkedIn’s company mission and vision, as well as our ethical principles. In this post, we’ll elaborate on our efforts along each of these dimensions.
Fairness and accountability by design
By fairness, we mean that the AI/ML models used for making decisions or predictions are not biased with respect to protected attributes such as gender, race, and age. By accountability, we mean that it should be possible to identify and assign responsibility for a decision made by the AI system. Implicit in the formulation of both of these definitions is the concept of “harm reduction” for end users.
If you would like to learn about the application of fairness-aware machine learning techniques in practice, we invite you to attend the upcoming tutorial at KDD'19, where we will discuss industry best practices and case studies, share the lessons learned from working on fairness in ML at companies such as Google, LinkedIn, and Microsoft, and present open problems and research directions. We also invite interested researchers and industry practitioners to attend the KDD Social Impact Workshop session on “Building features which benefit every member: Measuring inequality in the individual treatment effects in online experiments,” wherein we showcase our approach to measuring not just the average impact of interventions, but also their inequality impact amongst our members, as a building block to ensuring economic fairness.
Please refer to these excellent resources on fairness (related: tutorial, course) and on accountability to obtain an overview of these concepts as well as learn about key papers on these topics. Also, please refer to our earlier blog post and our KDD’19 paper for the fairness-aware ranking methodology and the technical architecture of our representative talent-search system that has been deployed to all users of the LinkedIn Recruiter product worldwide.
Privacy and security by design
Protecting the privacy of users and confidentiality of user data is a key requirement of web-scale applications and systems such as web search, recommender systems, crowdsourced platforms, and analytics applications, and, as discussed earlier, has witnessed a renewed focus in light of recent data breaches and new regulations.
If you would like to learn about the application of privacy techniques in practice, we invite you to view the slides for our past tutorial at The Web Conference 2019. It includes an overview of privacy concerns that have come to light over the last two decades, the lessons learned, and the evolution of privacy techniques leading to differential privacy definition/techniques. In these slides, you can see the case studies presented on this topic, such as Apple's differential privacy deployment for iOS/macOS, Google's RAPPOR, LinkedIn Salary and LinkedIn's PriPeARL framework for privacy-preserving analytics and reporting, and Microsoft's differential privacy deployment for collecting Windows telemetry.
On this blog, we have written previously about LinkedIn’s approach to tackling the challenge of delivering robust, granular analytics while protecting member privacy. Please refer our earlier blog post and our ACM CIKM 2018 paper for a description of the key privacy and use requirements, system design and architecture, key modeling components, experimental results, and more lessons learned from the production deployment of our system at LinkedIn.
Privacy and security design for LinkedIn Salary
Screenshot of “User Experience Designer” search.
The LinkedIn Salary product, launched in Nov. 2016, allows members to explore compensation insights by searching for different titles, companies, and regions. For each (title, region) combination, we present the distribution of base salary, bonus, and other types of compensation, the variation of pay based on factors such as experience, education, company size, and industry, and the highest-paying regions, industries, and companies. These insights are generated based on data collected from LinkedIn members using a combination of techniques to protect user privacy (such as encryption, access control, de-identification, and thresholding) and modeling techniques (such as outlier detection, Bayesian hierarchical smoothing, and inference) for ensuring robust, reliable insights.
Considering the sensitive nature of compensation data and the desire for protecting the privacy of users, a key requirement is to design our system such that there is protection against a data breach and that any one individual’s compensation data cannot be inferred by observing the outputs of the system. Further, we require the compensation insights to be generated based on only cohort-level data containing de-identified compensation submissions (e.g., salaries for UX Designers in San Francisco Bay Area), limited to those cohorts having at least a minimum number of entries. Our problem can thus be stated as follows: How do we design LinkedIn Salary system to meet the immediate and future needs of LinkedIn Salary and other LinkedIn products? How do we design our system taking into account the unique privacy and security challenges, while also addressing the product requirements?
High-level system architecture, consisting of components/services pertaining to collection & storage, de-identification & grouping, and insights & modeling.
Our system uses a service-oriented architecture, and consists of the following three key components: a collection and storage component (corresponding to steps 1-5 in the above diagram), a de-identification and grouping component (steps 6-11), and an insights and modeling component (steps 12-14). The collection and storage component of our system is responsible for allowing members to submit their compensation information, collecting different member attributes, and securely and privately storing the member attributes as well as the submitted compensation data. The de-identification and grouping component is responsible for constructing de-identified collections (“cohorts”) of member-submitted compensation information and ensuring that each cohort contains at least a certain minimum number of entries (inspired by k-Anonymity), before it is made available for offline data processing and analysis. For example, the cohort, “UX Designers at Google in San Francisco Bay Area” would be made available once there are enough entries, while the cohort, “CEOs at LinkedIn in San Francisco Bay Area” would never be made available since it can contain at most one entry. Please refer our IEEE PAC 2017 paper for a detailed description of the architecture, as well as the security and de-identification mechanisms.
We illustrate our design through a simplified example, wherein we do not show details associated with encryption, access control, service and data separation, and other security mechanisms. Suppose that a hypothetical member, Charlotte, working as a UX Designer at Google in San Francisco Bay Area, submits her salary to LinkedIn. Her salary is then associated with corresponding de-identified cohorts as shown below. Out of these, only cohorts having at least a minimum number of entries are made available for analysis and processing as part of insights computation.
You can also read our prior blog post that discusses the architecture of LinkedIn Salary, as well as some of the security and de-identification mechanisms for that product.
Transparency and explainability by design
Transparency refers to the requirement that the end user can understand how a decision or prediction is made by an AI system. Closely related to transparency is explainability, which corresponds to articulating why a decision or prediction is made. As a result of the increasing influence of AI systems in our day-to-day experiences and recent regulatory “right to explanation” provisions, which focus on the transparency of data-driven automated decision-making, we have witnessed a growing demand for model transparency and interpretability. In addition, model explainability is a prerequisite for building trust and adoption of AI systems in high-stakes domains requiring reliability and safety, such as healthcare and automated transportation, and in critical industrial applications with significant economic implications, such as predictive maintenance, exploration of natural resources, and climate change modeling.
If you would like to learn more about explainable AI in industry, we invite you to attend our upcoming tutorial at KDD'19. We will first give an overview of model interpretability and explainability in AI and techniques and tools for providing explainability as part of AI/ML systems. Then, we will focus on the application of explainability techniques in industry, wherein we will present practical challenges and guidelines for effectively using explainability techniques and case studies spanning application domains such as hiring, sales, lending, and fraud detection.
In previous blog posts, we highlighted the technical aspects of “fairness, accountability, confidentiality, and transparency by design” implementations. An important lesson we have learned is that building consensus and achieving collaboration across key stakeholders (such as product, legal, PR, engineering, and AI/ML teams) is a prerequisite for successful adoption of this approach in practice. Another key lesson is that we need to focus on these dimensions during all phases of the AI Lifecycle (Problem Formation, Dataset Construction, Algorithm Selection, Training Process, Testing Process, Deployment, and Monitoring/Feedback).
Please join us at KDD’19 and share your own experiences implementing fairness, privacy, and transparency by design.
Work of this nature is not possible without the efforts of several cross-functional, company-wide teams. In particular, we would like to thank all members of the LinkedIn Talent Solutions Diversity team, the LinkedIn Ad Analytics and Reporting team, and the LinkedIn Salary team for their collaboration in deploying our systems as part of the respective launched products. We would also like to thank Deepak Agarwal, Parvez Ahammad, Kinjal Basu, Erik Buchanan, Bee-Chung Chen, Patrick Cheung, Stephanie Chou, Tim Converse, Gil Cottle, Tushar Dalvi, Cyrus DiCiccio, Patrick Driscoll, Anthony Duerr, David Durfee, Nadia Fawaz, Joseph Florencio, David Freeman, Meg Garlinghouse, Taylor Greason, Gurwinder Gulati, Ashish Gupta, David Hardtke, Sara Harrington, Joshua Hartman, Parul Jain, Prateek Janardhan, Santosh Kumar Kancha, Nicolas Kim, Rachel Kumar, Sharon Lee, Heloise Logan, Divyakumar Menghani, Preetam Nandy, Lei Ni, Igor Perisic, Rohit Pitke, Hema Raghavan, Ryan Rogers, Rong Rong, Ryan Sandler, Badrul Sarwar, Cory Scott, Arun Swami, Ram Swaminathan, Ketan Thakkar, Janardhanan Vembunarayanan, Ganesh Venkataraman, Hinkmond Wong, Ya Xu, Lin Yang, Yang Yang, Chenhui Zhai, Liang Zhang, Yani Zhang, Lu Zheng, and Yang Zhou for their direct contributions and/or insightful feedback and discussions. Finally, we would like to thank collaborators from other companies/organizations on our tutorials: Sarah Bird, Krishna Gade, Ben Hutchinson, Emre Kiciman, Ilya Mironov, Margaret Mitchell, Ben Parker, Ankur Taly, and Abhradeep Guha Thakurta.