Bringing Project Every Member to life: Open sourcing our Spark inequality A/B testing library

Guillaume Saint-Jacques

Sr Engineering Manager, Apple

May 26, 2020

Co-authors: Guillaume Saint-Jacques, Stuart Ambler

Last month on this blog, we introduced how we are building inclusive products through A/B testing, something we internally call Project Every Member, and discussed the novel application of inequality A/B testing using the Atkinson index and the use of this technique in the product design process at LinkedIn. Today, we are excited to open source spark-inequality-impact, an Apache Spark library that can be used by other organizations in any domain where measuring and reducing inequality, or avoiding unintended inequality consequences may be desirable. This work is furthering our commitment to closing the network gap and making sure everyone has a fair shot at finding and accessing opportunities, regardless of their background or connections. Economists, data scientists, and software engineers interested in learning more can access the code on GitHub.

In this post, we will briefly summarize how we have been using measures such as the Atkinson index to evaluate proposed product changes, describe the results from recent A/B tests that had some interesting equalizing effects, and discuss our future plans for work in this space. It's worth noting that some of the concepts we touch on below are described in more detail in our previous post, which also covers the underlying intent with which we use the word "inequality" in this post.

a-b-testing-in-the-product-development-lifecycle

A/B testing in product design

LinkedIn’s culture is data-driven. Almost any change or new feature on our platform is subjected to a series of testing and analysis processes to help ensure that it achieves our intended product goals and business objectives, and that it proves useful to our members.

To do this, we typically leverage A/B testing. While there are many interesting approaches to A/B testing, the typical best practice is to start by giving a preview of the change or feature to a few members for a limited time, and then measure the results.

As an example, imagine an A/B test that, for a limited time, provides members with the suggestion that they might explore new courses on LinkedIn Learning. Most members will not get this prompt in their feed, but a small set of randomly-selected LinkedIn members will, and we will collect aggregate data on whether they engage with the Learning product or not.

However, A/B testing traditionally measures averages: it only tells us whether the average engagement with the new feature we’re offering improves. Imagine, for instance, that the prompt is worded to encourage only people who are already heavy users of LinkedIn Learning to engage with the product even more frequently, while members who had never heard of it before were now even less likely to engage than before. In that case, there might be an aggregate increase in our top-line business metric, but the product team might mistakenly assume that the change will lead to new members adopting the platform. Over time, these kinds of assumptions could lead to product designs that alienate the same members who stand to benefit the most from online learning.

Applying the Atkinson index and inequality A/B testing

We wanted to go beyond thinking about the average member and measure whether our features provided benefits for the members who needed it most. To do this, we turned to the Atkinson index, a standard measure of economic inequality. This index is useful in determining which end of the distribution contributed most to the observed inequality and allows us to encode other information about the population being measured into our analysis (e.g., whether product engagement among the people who use Learning the least is specifically impacted in a positive/negative manner by a given test). It is often used by economists for comparisons of income distributions.

For a more thorough formulation and analysis of the Atkinson index, including a discussion of why it lends itself to large-scale A/B testing scenarios such as product testing at LinkedIn, please see our prior blog post and our paper.

Atkinson on Spark
We decided to implement Atkinson index computations using Apache Spark due to scalability considerations. In particular, we needed to scale with respect to:

The size of the data over which to compute inequality (i.e., the number of individuals who are part of specific A/B tests)
The number of times inequality needs to be computed (for each experiment, each feature needs to be compared against every other feature, so we need to measure inequality separately for each feature in each population segment).

While inequality metrics can already be computed on R and Python, they typically require us to fit all the data in memory, within a single machine.

We are releasing a package that leverages the fact that the Atkinson index can be decomposed as a sum, which means the data does not to be held in memory all at once. We then use it as part of a larger pipeline that applies it to many A/B tests at once, as pictured below.

diagram-showing-the-spark-package-leveraging-the-atkinson-index

Recent inequality A/B testing results

Our previous post discussed how new LinkedIn features and products, like Instant Job Notifications, were identified as reducing inequality in areas like engagement and job applications. Among other results, we found:

Metric-neutral interventions are often not neutral for everyone
In many situations, teams may try to implement metric-neutral product changes: for example, when performing a back-end infrastructure change, or when trying to “boost'' or promote a specific product on the platform while making sure no other product suffers. The principal method of monitoring this neutrality is looking at a treatment-control comparison of the average impact (often called lift) in the A/B testing platform. However, even if no average impact is detected, there is often an inequality impact. In other words, metrics are not affected on average, but some users are. This makes inequality a critical aspect to monitor in such tests, as a “neutral'' result should mean neutral for everyone. We have seen examples of this while trying to promote a specific feature: it had no negative impact on other elements of the site on average, but increased engagement inequality, and was eventually canceled.

Notifications are a powerful tool
Notifications have a strong impact on inequality of engagement, as they are a powerful tool to bring the least-engaged members to the site. Conversely, reducing notifications to the most-engaged members can improve their overall site experience while also reducing engagement inequality across the site.

New member onboarding
Making sure more people benefit from LinkedIn requires adding new members, and helping them familiarize themselves with the platform and its value proposition for their long-term career growth. To that end, the quality of the onboarding process is of paramount importance, as new members historically have a high chance of dropping off a new service or platform in the first few days. We analyzed an A/B test that tried to accommodate new, onboarding members by assisting them through the use of nudges and notifications. Before that experiment, many new members did not receive any notifications. In this experiment, new members now received 1-2 push notifications during their first week, which encouraged them to take key actions to build their network (or similar actions that could help them find new opportunities). This had a positive impact on average engagement, but also on inequality of engagement, since it primarily helped members who were at the highest risk of dropping off the site.

Site speed and availability matters to inclusiveness
We found that many interventions relating to site speed and reliability had a disproportionately positive impact on the least engaged members, and reduced inequality. This makes sense, as those may be members with slower devices and connections.

The low-bandwidth LinkedIn app
Several A/B experiments on the low-bandwidth LinkedIn app showed a strong positive impact both on average engagement and on inequality in engagement. The LinkedIn Lite app targets members with slower network speeds or slower devices. Enabling or adding features that brought the low-bandwidth experience closer to the main experience (like enabling hashtags in the feed, for example), had positive inclusiveness effects.

On social network platforms, social capital matters for inclusiveness
Once an inequality impact (positive or negative) is found, we seek to understand it, in particular by asking whether we can identify two or more groups that are being affected differently by a proposed new feature or product. Note that we care about inequality whether or not it can be summarized as a differential impact on different groups, but identifying groups helps with interpretation. Using this inequality measurement technique as a detection mechanism, and then proceeding to a deep-dive, we have repeatedly found that a member's network strength (i.e., their social capital, how well-connected they are) often has an impact on how much value a member can get out of a social network.

Both positive and negative inequality impacts are often unintended
Throughout over a year of product review meetings and learning from product owners, we often found that inequality impact, including the inequality-reducing benefits we surfaced, were unintended. Designers may often think about users in an idealized fashion: as a representative, average user that does not actually exist. This may pose inclusiveness challenges, as it runs the risk of leaving users who do not resemble the idealized average behind. Again, looking at inequality impacts helps emphasize to product designers and leads that members are diverse in many ways, and that no one is “average.”

LinkedIn learning’s benefits are broad-based
Two recent initiatives to increase awareness of LinkedIn Learning triggered a surprisingly large reduction in engagement inequality: while engagement with LinkedIn Learning went up overall, people with previously low engagement were the ones whose engagement increased the most.

Future work

In our prior blog post, we mentioned that A/B testing is a complement, not a substitute, to additional approaches to creating more equitable products that are already implemented at LinkedIn, such as evaluating bias in datasets, various approaches to eliminating biases in AI systems, and qualitative member research.

In an upcoming series of posts, we plan to share more about the design of our work in these areas, as well as detail about how we visualize and share the results of inequality A/B tests.

We would like to thank Igor Perisic, Nicole Li, James Sorenson, Patrick Driscoll, Parvez Ahammad, Sofus Macskassy, and Ya Xu.

Topics: A/B Testing/Experimentation Data Product Design Research Data Science Machine Learning