Introduction: Technical Paper on LinkedIn's A/B Testing Platform
October 1, 2015
XLNT is the end-to-end A/B testing platform used at LinkedIn to not only solve the day-to-day A/B testing needs across the company, but also sophisticated use cases that are prevalent in a social network setting. With all the lessons learned from using the platform, we decided to take an in-depth look at how we approach A/B testing and write a technical paper. This post is based on our paper and shares how we built the platform, dealt with some challenging scenarios and fostered a strong experimental culture.
The XLNT Platform
XLNT was designed to encompass each of the three steps of the testing process: design, deploy and analyze.
A highlight of our design capability is flexible targeting. Not only does the platform provide 40+ built-in member attributes stored in Voldemort for experimenters to leverage, it also allows external attributes to be onboarded seamlessly and provides an integrated way for real-time attributes available only in a runtime request to be used.
In the deploy stage, we have a straightforward two-step process to implement an experiment in the application layer and have enabled centralized service configuration that is totally independent of application code release, leveraging Rest.li and Databus.
Finally, analyzing experiments is fully automated, with the pipeline consuming more than 10TB of data and producing more than 150million summary records stored inPinot on a daily basis. This is a large scale join and aggregate process enabled by theCubert framework, which consumes application code logs ETLed to our HDFS clusters from Kafka topics and data for 1000+ engagement metrics, preprocessed by an independent pipeline. A highlight of our analysis pipeline is its ability to enable multi-dimensional analysis in certain scenarios for experimenters to dig deeper and get more actionable insights.
Beyond the Basics
We face several challenging A/B testing scenarios at LinkedIn, some of which are specific to experimentation on social networks.
In an organization running hundreds of experiments daily, interactions pose a serious threat to experiment trustworthiness. We use XLNT to address the three most common concerns and use cases related to interactions between experiments. While experiments are fully overlapping and orthogonal by default, there are simple solutions to splitting traffic to allow experiments to run disjointly, allow interaction analysis in a full factorial fashion before analyzing each factor separately, and enable fractional factorial design, where only certain combinations from different factors are implemented and analyzed.
We have enabled testing on guests (based on browser IDs) as well as on other units. A challenge we have resolved is to serve a unified experience for users switching between member and guest status, while ensuring we have measurement for both. An even more interesting problem arises when there are different experimental units within the same entity type, arising particularly in a social network setting, where the same user can play two different roles with each needing to be tested separately. In the paper, we highlight this problem with an example based on a “viewer/viewee” experiment and describe the bias variance tradeoff.
Offline experiments are integrated into XLNT as well. The challenge here is to avoid selection bias when we run email experiments, experiments coupled with email campaigns, and cohort experiments. We can’t simply use active members as the population set for email experiments. We also have to correct/avoid bias if we want to analyze the effect of the experiments and email campaigns together/separately. When running cohort analysis there is a subtilty of dynamically updating the cohort selection during the experiment when the selection criteria and the experiment outcome are not independent.
When it comes to network A/B testing, we can’t assume sample responses are independent of the treatment assignment of others. Our solution is based on a sampling and estimation framework. In the sampling stage, we partition users into clusters and randomize at cluster level. In the estimation stage, we used some more sophisticated estimators. The network A/B tests we have run at LinkedIn based on this sampling and estimation framework have indicated strong network effects.
Fostering an Experimental Culture
There are several XLNT features and concepts we introduced at LinkedIn to enable us to take education and evangelization past the "classroom".
We integrated experiment reports with business reporting by using a unified metric definition across the entire organization. This provides the foundation that enables other organizations such as Finance to bake A/B test results into business forecasting.
We also introduced site wide impact, a concept that not only allow us to provide a directional signal, but also the size of the global lift that will occur when the winning treatment is ramped to 100 percent. We conceptualized this feature so that we can compute site wide impact leveraging readily available summary statistics without having to doubling our computation effort. A paradox is that for metrics like “CTR”, local and site wide impact can disagree directionally.
As an effort to simplify multiple testing, we introduced a simple two-step rule of thumb for experimenters to follow that is mathematically equivalent to a Bayesian interpretation of people’s prior belief on whether a metric would be impacted.
To drive greater transparency regarding experiment launch decisions, we launched Most Impactful Experiments, a tool we built to bubble up notable impacts among all experiments for each product metric. We use a three-step algorithm to control false discovery. A couple of key lessons we learned from building the feature are shared in the paper.
For a more in-depth look at LinkedIn’s A/B testing strategy and technology, read the entire paper.