A/B testing at LinkedIn: Assigning variants at scale

December 16, 2020

Co-authors: Alexander Ivaniuk and Weitao Duan

Editor’s note: This blog post is the second in a series providing an overview and history of LinkedIn’s experimentation platform. The previous post on the history of LinkedIn’s experimentation infrastructure can be found here.

Introducing variant assignment

Previously on the blog, we’ve shared a look into how experimentation works at LinkedIn with an overview of our experimentation platform, “T-REX,” and a deep dive into the architecture of the Lix engine. However, no matter how well-designed an infrastructure is, the ability to scale depends on an efficient and scientifically-valid variant assignment method. In this post, we will discuss a practical approach of making variant assignment blazing fast to handle trillions of invocations per day, as well as invariants and tricky parts of using such methods for online experiments.

The validity of A/B testing is rooted in the assumption that each variant is assigned to a random member. That’s why it’s critical to find a good randomization algorithm. There are three main characteristics of a good variant assignment function:

Assignment of variants to members happens according to the desired split (i.e., no sample size ratio mismatch)
Variant assignment of a single member is deterministic (i.e., repeatable)
No correlation between the assignments when there are multiple experiments running

In particular, the third characteristic is critical because it suggests that a member’s assignment in one experiment has no effect on the probability of being assigned to a variant in other experiments. This is important because we want to analyze each experiment separately. We will talk more about this characteristic when we go through an example below.

General approach

Suppose we have a test T, its population P, and variants V₀, V₁, V₂, …, V_N-1, where each of the variants is assigned a fraction of P equal to A₀, A₁, …, A_N-1, and A₀ + A₁ + … + A_N-1 = 1. Let’s assume that P is a member population and that all members are assigned with integer identifiers.

Let’s represent the variants in a one-dimensional coordinate system on a line segment [0, 1). All the variants’ subpopulations can be represented as subsegments within the [0, 1) segment: V₀’s population will correspond to [0, A₀), V₁‘s population to [A₀, A₀ + A₁), …, V_N-1‘s population to [A₀ + A₁ + … + A_N-2, 1]. The problem of assigning a variant to a member is now reduced to selecting a point on the line segment [0, 1).

In order to assign a variant, the following algorithm can be used:

Project the entire test’s population onto [0, 1)
Shuffle it in a pseudo-random way along the segment
Find a variant subsegment that owns the shuffled projection point of a member and return the corresponding variant

In a more concrete example, let’s assume the following:

Population P is 100,000 members large.
The experiment of interest has 3 variants, A, B, C, that are assigned to 20%, 40%, and 40% of P, respectively.
We have members with IDs, as shown in the picture below, to perform variant assignment.

For the purpose of this explanation, we are interested in how member ID #8000 gets his or her treatment:

Member ID #8000 is mapped to the point 0.08 as a result of computing <MEMBER_ID> / <NUM_MEMBERS>
Then, we apply a random transformation to point 0.08 and project it into point 0.848.
We check the variant allocation axis and see that members with ids mapped to points with coordinates between 0.6 and 1 get variant C, so member ID #8000 also is assigned with variant C.

graph-showing-variant-mapping-for-member-poopulation

Projection, shuffling, and final variant mapping for the member population

Assignment methods

There are two different ways of performing the variant assignment in practice. We will describe such methods based on the design described in our previous blog post of T-Rex, which enables an experiment to target a subpopulation of Linkedin.com based on member attributes.

Use a random number generator (RNG). The objective here is to randomly select a point in [0, 1) and then assign a corresponding variant to a member. Then the (test, variant, member) tuple is stored and retrieved from the store on all subsequent evaluations.

graph-showing-random-number-generator-based-variant-assignment

RNG-based variant assignment

Use hashing. The variant can also be retrieved via hashing. Such a method requires the defining of a family of hash functions HASH(test_id): memberId -> [0, 1) where:

each of the functions maps a space of member ids uniformly into [0, 1),
“salt” is a constant, unique value chosen for each A/B test (in the above example, the test_id),
any two distinct functions from the family should generate independent distributions of the projection points.

Such a family of hash functions could be created from a common crypto digest function FCrypt: byte_array -> [0, F_MAX], where F_MAX is the largest value returned by the function and [0, F_MAX) is an integer range. This entails the following steps:

Remap the function into [0, 1) by applying HASH(bytes) = F(bytes) / F_MAX.
Encode “salt” as a fixed-width byte prefix (e.g., a 4-byte prefix to make sure that it will be able to fit all possible values of salt).
Encode member_id as a variable-size byte array.
Then HASH(salt)(member_id) = FCrypt(concat(prefix(salt, 4), bytes(member_id)).

Hash-based variant assignment

Choosing the best method

The hash-based variant assignment approach is one of the cornerstones of LinkedIn’s experimentation system. Consider this: the RNG approach requires experiment operators to execute a remote call to a cache or an “online” data store every time a test variant is evaluated. Such an online variant assignment store can evolve into a critical dependency and bottleneck for the entire experimentation platform, which means that availability, consistency, and speed of variant assignment will be limited by the parameters of the data store. Taking into consideration the CAP theorem, the experimentation platform would not be able to achieve repeatability of the variant assignment’s result, which requires strong consistency, high availability, and network partition tolerance.

Let’s look at a possible topology of such a solution below. Such a system may experience problems in one of the following scenarios:

When the assignment service, the cache, or the store are slow or down.
When there is a network partition event between data centers or the link that is slow, thus either hurting the availability of the system or impacting the replayability of the assignments
When there is a network partition event inside the data center

Infrastructure topology for the RNG assignment method

On the other hand, the hash-based assignment approach is deterministic and enables us to use an advanced form of caching with about 99.98% requests being evaluated in application instances, and only a mere 0.02% of evaluations resulting in network requests to the experimentation backend using the following topology:

infrastructure-topology-for-the-hash-based-assignment

Such a great locality of evaluations is made possible by caching the experiment definitions within each service running A/B tests and being able to process variant evaluations rules in the experimentation engine, which is a part of the A/B client library:

graph-showing-how-the-infrastructure-from-external-request-to-experimentation-backend

Referencing back to the CAP theorem, our infrastructure is able to tolerate downtime of the experimentation backend, network partition events between or inside data centers, and any slowness of the experimentation backend.

At our scale of up to 35 trillion evaluations and 41,000 A/B tests, the difference between the assignment methods is substantial:

table
Metric/Approach	RNG (Estimated)	Hash-based
Assignment storage write QPS	1,770,000 - 25,000,000	50,000 Kafka write QPS for select assignment data
Assignment storage disk read QPS	26,000,000	Up to 80,000
Cache read Network QPS	240,000,000-260,000,000	Up to 1,000,000
Online assignment data storage size	26-390 TB/day for assignment data	200 GB for member attributes data, total
Offline data storage	26-390 TB/day for assignment data	200 GB/day for select assignment data and 200 GB/day for member attributes
Complexity of evaluations	O(num_tests * num_members)	O(num_members) Local evaluation of all tests for a member is significantly faster than a remote call and can be treated as a constant

In comparison, the RNG method would have required us to:

Handle at least 240x QPS with a distributed cache than what we handle now in the experimentation backend.
Have at least 3900x capacity in the online storage system for variant assignment, assuming a 30-day retention period for the RNG method’s data.
Handle a much higher disk read/write pressure.

Recording or replaying variant assignment

In some cases, we continue to record variant assignment data. Services with A/B tests send data to Kafka that then gets transferred to Hadoop (depicted in the diagram below). This is done to carry attributions of members’ actions to certain experiments and variants during batch A/B report generation, which is considerably less sensitive to both the latency of data synchronization and the availability of the assignment data store.

In some cases, certain tests may result in trillions of variant assignments per day, so we offer a way for these respective test owners to optimize their Kafka resource footprint: a test owner may tell the experimentation platform to assume that the experiments’ population comprises either all active members of the site or all the registered members. In that case, the experimentation client will not send experiment assignment data and the platform will approximate the variant assignment on Hadoop.

The assignment data flow: storage, ETL, and offline approximation

Independence of variant assignment

When there are multiple experiments running, we want to be able to analyze each experiment separately. This can only be done based on the assumption that the assignments in all the experiments are independent. Imagine we have two concurrent experiments: Experiment 1 tests two relevance models for LinkedIn feed and Experiment 2 tests two font sizes. Let Z₁ and Z₂ denote the assignment of a member in Experiment 1 and 2 respectively. In mathematical formulation, the third characteristic can be written as:

Prob(Z₁= z₁) = Prob(Z₁ = z₁| Z₂ = z₂), where z₁and z₂ are the possible variants in Experiment 1 and 2.

We call Experiment 1 and 2 orthogonal because the assignments are independent. If Experiment 1 and 2 were both set up with a 50-50 split between the two variants, members assigned to Model 1 have an equal chance of getting the two font sizes, and vice versa. Assume the baseline (Model 2 and Font 2) has 8 page views per member, Model 1 increases the average page views per member by 5 compared to Model 2, and Font Size 1 increases by 10 compared to Font Size 2. Let’s also assume there is no interaction between the two tests. If we analyze Experiment 1 separately, the estimated effect of Model 1 vs. Model 2 is estimated below:

formula-comparing-estimated-effects-of-model-one-versus-model-two

In other words, we can ignore that Experiment 2 is actually running and proceed to analyze Experiment 1 only.

At LinkedIn, we apply MD5 hash function to the combination of member id and experiment identifier. MD5 hash function has also been tested in Chi-squared tests to satisfy the independent assignment characteristic.

Interactions between the variants

In the previous example, we assume there is no interaction without properly defining it. A statistical interaction between two variants A and B exists if their combined effect is not the same as the sum of two individual treatment effects. In the previous example, the combined effect of Model 1 and Font Size 1 could be 17, instead of the sum 15 because the combination may work better than applying individual variants. There are also examples of negative or harmful interactions. For example, let’s say we have two tests, one of which controls the font color (blue vs. black), while the other controls the background color (blue vs. white). While blue font color may work better than black on the white background, it renders the text illegible on the blue background. We should also note that it is impossible to completely prevent interaction when operating in a large-scale experiment system. Teams working on different components of a larger product may be unaware of features tested by other teams and may not check for interactions.

To detect the interaction, we provide a tool that can detect pairwise interactions between any two experiments. Three-way or higher-order interactions, although possible, are extremely rare. Our empirical experience suggests that even the pairwise interactions are quite rare, especially between experiments from two distinct product pillars (e.g., messaging and advertising). When interactions do show up, they are often in a smaller magnitude compared to the mean effects.

Sample size ratio mismatches

Sample size ratio mismatch (SSRM) happens when the observed sample size ratio between two variants does not follow the expected ratio. For example, we may expect a 50-50 split, but observe 100k members in the treatment group and 110k members in the control group. This usually indicates some underlying issues with the experiment setup. The reports from such experiments become untrustworthy and any decisions based on them must be treated as misleading to an extent.

At LinkedIn, we leverage the Chi-squared goodness-of-fit test to monitor for ratio mismatch. For simplicity, we assume there are only two variants in the experiment, n_t and n_c are the observed sample sizes for the treatment and control group, and the expected treatment ramp ratio is r_t. The expected sample sizes for the treatment and control group are E_t = (n_t + n_c) r_t and E_c = (n_t + n_c)(1 - r_t) respectively. Under the null hypothesis that sample sizes follow the expected split, the Chi-squared statistics

chi-squared-statistics-for-E-t-and-E-c-respectively

has a χ²₁ distribution. When the observed Chi-squared statistics is too extreme, we conclude the observed sample size does not follow the expected ratio. Note that when SSRM is detected, it does not tell us where the problem is—further investigation is needed to understand the root cause.

At LinkedIn, when SSRM is detected for an experiment, we hide the report by default and show a highly visible warning on our UI.

Sample notifications when SSRM is detected

Without knowing the underlying cause, SSRM can be difficult to adjust because A/B tests operate on the assumption that members are assigned to variants randomly according to the split. This makes it critical to not only diagnose the root cause of SSRM, but also apply the corresponding fix. Here we summarize a few typical causes for SSRM.

Dynamic targeting: Targeting refers to running experiments on a specific member set according to their properties and activities. While most targeting criteria are static for our members (such as country or industry), we also use more dynamic targeting criteria that can change over time. When we run experiments with dynamic targeting criteria, we need to consider whether a treatment itself would interact with the targeting criteria and hence would result in members switching in or out of a targeting criteria at different rates in different variants. For example, let’s imagine that a marketing team wants to send emails to inactive members and test two subject lines in an A/B test. It turns out that one subject line performs better and brings more members back to the site. As the team continues to ramp up the winning subject line, the team observes SSRM. This is due to the fact that there were fewer dormant members in the winner variant because many had already converted to become active again. The correct analysis approach is to use pre-experiment member properties and fix the targeting cohort in the entire ramping process.
Residual effect: The residual effect usually refers to the contamination of a former experiment to a subsequent one on the member split. When the treatment effect is large, it may change the frequency at which members visit. As a result, the residual effect may lead to mismatched sample sizes. One such example was with our People You May Know relevance algorithm improvement. In the first ramp, the experiment showed promising results on engagement across the board. However, starting from the second ramp, sample size ratio mismatch started to occur. It turned out that this was purely due to the fact that the new algorithm was so good that it made members more active and come back more often. Such bias can be corrected with re-randomization and the inclusion of members from the very beginning of the experiment in the analysis. However, from our experience, it is rare that treatment itself would be impactful enough to change member re-trigger rate.
Biased code implementation: Biased code implementation can create similar symptoms as if there is a residual effect. For example, when we re-designed the LinkedIn homepage, the variant evaluation code was implemented in two places: 1) when members hit the router by typing ‘linkedin.com’ and 2) when members enter the new homepage directly with designated URLs. These URLs can only be accessed by members in the new homepage group (treatment group). The second code call was needed so that if the experiment was terminated, members would lose access to the new homepage with the URLs. After running the experiment, we saw more members than expected in the treatment group after the first ramp whereas including members from the first experiment would resolve the bias. Upon further investigation, we discovered that some members entered the site through the designated URLs exclusively as they had saved those URLs in bookmarks. Because only those in the treatment group would be evaluated in such cases, the treatment group had a higher members count than expected.

Because diagnosing SSRM can be time-consuming for the operators of the experiment, we created an automatic tool to run a series of diagnoses and summarize the results and potential root cause. Teams at LinkedIn can receive SSRM diagnostic reports on our experimentation platform with just a few clicks. Because the tool is self-service, it eliminates the need for the platform team to be involved in every SSRM root cause analysis. This greatly speeds up the diagnostic process and helps unblock the ongoing experiments.

An experimentation platform is a complex system with many moving parts, and variant assignment at scale is the foundation for scientific experimentation. By sharing key considerations in how we at LinkedIn have implemented the mathematical principles of A/B testing into a robust infrastructure, we hope it will help others in their own experimentation journeys.

References

[1] Chen, N., Liu, M., & Xu, Y. (2019, January). How A/B Tests Could Go Wrong: Automatic Diagnosis of Invalid Online Experiments. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (pp. 501-509).

[2] Xu, Y., Chen, N., Fernandez, A., Sinno, O., & Bhasin, A. (2015, August). From infrastructure to culture: A/B testing challenges in large scale social networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 2227-2236).

Topics: A/B Testing/Experimentation Scalability Infrastructure