Building inclusive products through A/B testing

Guillaume Saint-Jacques

Sr Engineering Manager, Apple

March 31, 2020

Co-authors: Guillaume Saint-Jacques, Amir Sepehri, Nicole Li, and Igor Perisic

Introduction

Previously on this blog, we’ve shared information on best practices in data science, particularly in areas such as A/B testing. We’ve also discussed the importance of ethics in fields such as data science, early implementations of “fairness by design” principles in our products, and our commitment to sharing our research in order to further the industry conversation about designing systems that spread economic opportunity. These findings are shared with the aim of highlighting the real-world positive impacts of data science and encouraging further industry discussion around best practices in responsible product design.

In this post, we discuss a novel approach to integrating product A/B testing and inequality measurement concepts from the field of economics. We also discuss the methodology we have adopted for lowering barriers to economic opportunity in how different groups of members use our products. Finally, we provide examples of how it is helping to reshape research and design practices at LinkedIn, through a few choice case studies from the thousands of network A/B tests that we have already analyzed.

It is worth emphasizing that the term “inequality” is used throughout this blog post in the following ways:

To establish inequality baselines, we use the Atkinson inequality index, which can be applied to any metric, and captures how unequally it is distributed (if everyone has the same amount of that metric, inequality is 0; if some people have a large amount and others nothing, inequality is high). It is routinely applied to income or wealth by economists. Here, we are applying it to metrics that capture economic opportunity for our members on LinkedIn.
To measure the impact of our experiments on inequality baselines, we use inequality impact, which is used to measure the effect an experiment has on baseline inequality in our metrics. For example, if job applications are very unequally distributed, and an intervention makes them more equally distributed (e.g., by helping people who normally apply to few jobs apply to more of them), we say that there is an inequality reduction impact on job applications.

Product design, fairness in AI, and A/B testing

In recent years, researchers and industry experts have devoted a great deal of time to exploring the unintended consequences of applied technologies. Three primary areas of concern to many of us in the technology industry include:

The tendency for algorithmic systems to “learn” or otherwise encode real-world biases in their operation (and then further amplify/reinforce those biases);
The potential for product design to differentially benefit some groups of users more than others;
Sparse or poor data quality that leads to objective-setting errors and system designs that lead to suboptimal outcomes for many groups of end users.

While there are many well-documented examples of these and other types of problems in the technology industry, developing a data-driven solution is not a straightforward task (see recent publications from SafeAI@AAAI, FAccT, and others).

Towards a framework for addressing fairness issues in products
Given the complexity of this topic, there are likely many ways we could go about ensuring that our members benefit as equally as possible from our products. Before showing our solution to the problem, we want to also advance a set of principles that underpin our thinking:

First, considering what the end result is after people have engaged with a product should be as important as understanding whether an algorithm is “intrinsically” representative or fair. For example, any system that seems to be treating men and women similarly, but still results in women getting disengaged over time, is generally undesirable. These kinds of outcomes may be due to a host of reasons that exhibit patterns of structural inequality in the real world, such as social biases, cultural norms, etc.
Secondly, collecting such data may be otherwise problematic. Tracking all protected categories for discrimination would require collecting sensitive data, potentially at odds with members’ expectations of privacy and with data security best practices (e.g., minimization, etc.) that are subject to complex, overlapping privacy laws and regulations. Currently, LinkedIn does not use sensitive demographic data (as defined by GDPR, e.g., race, ethnicity, religion, political preferences, etc.) for our Recruiter product or for marketing services; members in some regions can opt-in to provide limited demographic data for aggregate reporting purposes.
Finally, using existing demographic categories may not map to or reflect all kinds of inequality directly. It is possible that we may be overlooking many opportunities to improve our products if we look solely at the categories we are explicitly monitoring. We would like a way to identify instances of these kinds of functional inequalities that do not map to existing categories of users during the product testing process.

In summary, even if data on members’ demographic categories is available, it is not a panacea for identifying inequality impact. Even if a product may seem to have been designed in a “responsible” or “fair” manner based on assumptions of demographic parity, it can still drive a wedge between different groups of users. For instance, an app update that improves overall engagement but runs slowly on older mobile devices might dramatically affect members across many demographic categories in a manner that does not appear in a typical product A/B test.

Traditional A/B testing looks at averages, focusing on an idealized “average user.” However, people may respond to new products in ways that a designer never intended. In order to be inclusive, we need to look beyond the average. The approach that we’ve developed, outlined below, instead empowers leaders to design products that are more inclusive and equitable, regardless of the causes of an underlying disparity. This helps to overcome the “average user” problem of traditional A/B testing. Building more equitable products is also good business, as making sure no one is inadvertently left behind is key to long-term growth.

showing-the-inequality-impact-of-a-b-testing

A stylized example of traditional A/B testing assumptions vs. results that show inequality impact

An A/B testing approach to measuring fairness and inequality impact
For several years at LinkedIn, we have used a series of scalable experimentation platforms to analyze product changes, AI model revisions, and many of the business decisions at our company. Instead of characterizing a feature in the void, experimentation measures the effect it has on real users. As we explored the implications of social network composition on economic outcomes, our team identified a unique opportunity to apply this methodology to LinkedIn products. We then started measuring the inequality impact of all new features and product changes (whether they are algorithms, UI tweaks, or infrastructure changes, for example) and flagging experiments that have a notably positive or negative inequality impact for users that are more or less well-connected on LinkedIn. Subsequently, we applied this method to every single experiment that LinkedIn has conducted over the past year and analyzed the ones with the highest inequality impact-increasing or -reducing effect.

a-b-testing-in-the-product-development-lifecycle

An example of A/B testing within the product development lifecycle

Inequality measures using the Atkinson index
The Atkinson index is a standard measure of economic inequality that is useful in determining which end of the distribution contributed most to the observed inequality. It is often used by economists for comparisons of income distributions. For example, the United Nations Human Development Programme leverages the Atkinson index to allow for meaningful comparisons of policy impacts across countries with widely differing ranges of income distributions, while the U.S. Census uses the Index for income comparisons within American society.

The Atkinson index can be formulated thusly: Given a sample x₁,…,x_n, representing a metric (such as sessions, page views, connections accepted, etc.) for a group of users indexed by i, and an inequality-aversion parameter ϵ. It is computed as follows for ϵ≠1:

Intuitively, for a fixed ϵ , a low value of A_ϵ is obtained when all individuals have almost the exact same metric value. When this is exact, the Atkinson index is zero. A higher value reflects more inequality. The ϵ parameter can be tuned to reflect the inequality preferences of a decision maker, as will be shown in the next section.

The Atkinson index has several desirable properties for our application, namely:

It is zero only if all individual metrics are the same. This is helpful in comparing distributions to the “pure equality” baseline, as shown in the above figure.
It satisfies the population replication axiom. If we create a new population by replicating the existing population any number of times, the inequality remains the same. This is particularly helpful because our member base continues to increase, and the simple fact that the number of users is growing should not change the inequality index, unless actual inequality increases are happening at the same time.
It satisfies the principle of transfers. Any redistribution (i.e., reducing one user’s metrics to increase the metrics of another, lower-ranking user) results in a decrease of the Atkinson index, as long as this intervention does not change their rank in the distribution. This can be useful to specifically assess the impact of “redistribution” experiments, such as redistributing attention on the feed.
If all metrics are multiplied by a positive constant, the index remains the same. This is useful because it allows us to compare the inequality of distributions across different time horizons meaningfully. For example, if all users have the same number of sessions every day, the inequality will be the same, whether measured as daily, weekly, or monthly sessions.
It is scalable. Looking at the formula above, one can see that the Atkinson index lends itself to distributed computation, using map-reduce or Spark, for example.

Scaling this approach with existing design processes
As mentioned previously, experimentation is deeply embedded in LinkedIn’s decision-making processes and company culture. Every day, we run hundreds of A/B tests tracking thousands of metrics—everything from minor visual changes in an app to improvements to our AI-powered recommendation algorithms. Starting last year, we began tracking the inequality impact of our experiments on core business and member value metrics. Since our addition of this analysis to the typical A/B testing process, we’ve also created a special multidisciplinary team at the company that discusses the impact of notable experiments. This team then invites the owners of these products to a working session to discuss the impact.

In essence, along with asking "What would be the total number of sessions or contributions on our platform if feature A (vs. B) were rolled out?” we can also ask "If feature A were to be rolled out, what would be the share of contributions from the top 1% of members, in terms of engagement and contributions? Would inequality impact go up or down between our most and least engaged members?" After the meeting, experiment owners are encouraged to do a deep-dive to better understand why the change may be having a disproportionate impact among different subgroups, and to share their learnings. The findings from these experiments have also gradually made their way into regular updates that are shared throughout the product and data science organizations at LinkedIn.

Combining measures of inequality (drawn from economics literature) and A/B testing provides us two distinct advantages.

First, instead of only measuring inequality impact, we can also trace it back to its causes: a specific set of features and product decisions. We can ask, “Are our products responsive to the needs of every LinkedIn member?”
Second, unlike classical algorithmic fairness approaches, it helps us identify features that increase inequality impact without having to rely only on explicitly protected categories. A feature that increases the gap between any two groups of members is likely to be detected.

Example of an experiment: Instant job notifications

The effect that this methodological change has had at LinkedIn can be seen in our products. Earlier this year, we rolled out a new feature called “Instant Job Notifications” that sends a push notification to active job seekers as soon as a new, relevant job is posted. These kinds of features can be a contentious issue, as we need to be selective about the kind of events on our platform that trigger a notification. During A/B testing, our review team found that this new feature had a significant equalizing effect—it matched the right people to the right job, regardless of how new they were to the site, where they are in their career, or the relative strength of their online network.

We were able to identify two main reasons for this change. First, prior to this feature being rolled out, members with less well-connected networks (“social capital” in the image below, a measure of how many connections a member has and how likely those connections are to bridge different clusters of members) were not as likely to be referred to these jobs or to hear about them through their network. Second, sharing the job opportunity before it had received a high number of applicants made members more likely to apply; there is a wealth of research on how self-censorship and other socially normative behaviors can disproportionately impact some groups more than others.

example-of-a-feedback-panel-from-inequality-impact-testing-process

An example feedback panel from the inequality impact testing process, using obfuscated data

Our analysis found that Instant Job Notifications had several positive effects for members. In terms of our core metrics, it increased the number of job applications we saw for a given posting and also increased the chance of an application receiving interaction from the prospective employer. Finally, the illustration above shows that engagement from members with “low social capital” also increased.

Analyzing thousands of experiments: What we’ve learned so far

Fairness through experimentation should not only be about setting guardrails to detect new features that could be potentially damaging to our members; it should also be about finding inspiration to create a more inclusive product. Contrary to “top-down” fairness criteria or ex-ante requirements, we try to learn from our colleagues and engineers as they experiment on the features they are building. Through this, we are compiling a knowledge base of types of interventions that seem to reduce inequality impact, so that it may be used as a guide for the development of our next products. Below, we share a few of the things we have been able to learn so far.

Metric-neutral interventions are often not neutral for everyone: In many situations, teams may try to implement “neutral” interventions—for example, when performing a back-end infrastructure change, or when trying to “boost” or promote a specific product on the platform while making sure no other product suffers. The principal method of monitoring this neutrality is looking at a treatment-control comparison of the average impact (often called lift) in the experimentation platform. However, even if no average impact is detected, we found many instances where there was an inequality impact. In other words, metrics were not affected on average, but some members were. This makes inequality impact a critical aspect to monitor in such experiments, as “neutral” impact should mean neutral for everyone.
Notifications are a powerful tool: Notifications have a strong impact on inequality of engagement, as they are a powerful tool to help orient less-engaged members towards useful features on the site. Prior work has also shown that our efforts to strategically batch notifications for highly-engaged members resulted in a qualitatively better user experience for that group of users. In short, an inequality-aware design pattern may be to handle notifications differently for different groups of members, based on their engagement levels.
New member onboarding is extremely important: Making sure more people benefit from LinkedIn requires helping new members familiarize themselves with the platform and its value proposition. Providing a richer onboarding experience had a positive impact on average engagement, but also on equality of engagement, since it primarily helped members who were at the highest risk of dropping off or overlooking useful features on the site.
Site speed and availability matter to inclusiveness: We found that many interventions relating to site speed and reliability had a disproportionately positive impact on the least engaged members, and reduced inequality. This makes sense, as a suboptimal experience could lead to lower engagement, and members who only have access to slower devices and connections may also experience other structural inequalities that limit their opportunity.
The low-bandwidth LinkedIn app: A corollary to the above point is that inclusive product design needs to account for members with slower devices or limited data plans. Several experiments on the low-bandwidth optimized LinkedIn app showed a strong positive impact, both on average engagement and on equality in engagement. Adding features that brought the low-bandwidth app experience closer to the experience of members using the default app or the desktop experience had positive inclusiveness effects.
On social network platforms, social capital matters for inclusiveness: Once an inequality impact (positive or negative) is found, we seek to understand it, in particular by asking whether we can identify two or more groups that are being affected differently by the experiment. We have repeatedly found that a member’s “social capital” often has an impact on how much value she can get out of a social network.
When it comes to inequality, unintended consequences are the norm: Throughout over a year of experiment review meetings and learning from experiment owners, we have often found that both negative and positive inequality impacts are unintended by product managers. Designers and product managers may often think about members in an idealized fashion: as a representative, average user that does not actually exist. This may pose inclusiveness challenges, as it runs the risk of leaving members who do not resemble the idealized average behind.

A/B testing at scale for large-scale change

Our approach to fairness is to use A/B testing as a complement, not a substitute, to additional approaches to creating more equitable products that are already implemented at LinkedIn, such as evaluating bias in datasets, various approaches to eliminating biases in AI systems, and qualitative member research. The most important thing in any design process (AI modeling, product design, experimental design, etc.) is to properly define the problem you are trying to solve. To do this, we need to collaborate with experts across domains and look for new ideas outside of the typical best practices in the technology industry. Similarly, it truly takes a variety of approaches and viewpoints to understand how these systems will fully impact an end user. In the end, this requires an investment in tools, culture, and processes that make our engineers talk to their product/design partners—the domain knowledge experts.

As the world’s largest professional network, LinkedIn has a unique opportunity to help close opportunity gaps, such as the skills gap and the network gap. Since neither skills nor networks are the basis for legally protected categories, utilizing a new, inequality impact-based approach to A/B testing that helps us detect the unintended consequences of new products and features is just one of the ways that we are making this happen.

Using a number of LinkedIn examples, we have argued that instead of just looking at the average effect of decisions, leaders should also consider the inequality impact when designing products. For each potential innovation, this means asking: “Does this proposal increase or decrease inequality among our members?” or, more specifically, “Does this proposal benefit members with low social capital as much as it benefits well-connected members?” We hope that increased understanding of the underlying causes of inequality can lead to similar approaches to ethical product design across several different industries.

Read the full paper on arXiv.

Updated April 2, 2020 to add a reference to opt-in data on the personal settings page.

Topics: A/B Testing/Experimentation Data Product Design Research Data Science Machine Learning