Our evolution towards T-REX: The prehistory of experimentation infrastructure at LinkedIn
September 24, 2020
Editor’s note: This blog post is the first in a series providing an overview and history of LinkedIn’s experimentation platform.
At any given time, LinkedIn’s experimentation platform is serving up to 41,000 A/B tests simultaneously on a user population of over 700 million members. Operation at such a scale is enabled with the LinkedIn Targeting, Ramping, and Experimentation platform, or T-REX. It started small, but growing internal demand and external forces have led us to scale and evolve the T-REX platform over the past decade. Originally conceived as an experiment management and delivery system with a UI application, the system gradually evolved into a platform that comprises targeting, dynamic configuration and experiment infrastructure, insight and reporting pipelines, a notification system, and a seamless UI experience.
Overall, three main factors heavily influenced development of the T-REX infrastructure in the past decade:
Rapid growth of the company,
Exponential growth of the data available for analysis,
Internal cultural shift, during which experimentation became an intrinsic part of the release process.
If you have been closely following the LinkedIn Engineering Blog, you may have seen numerous posts on A/B testing over the years, but it is the first time we are describing the history of the T-REX platform as a whole. We will take a look at the evolution of the platform’s infrastructure, as well as some of the foundational principles and decisions that shaped it. (Note: During the long history of the platform, it’s had multiple incarnations and carried different names, so please do not be confused if it is called LiX (LinkedIn eXperimentation) or XLNT (an eXperimentation framework) in previous posts.)
What is A/B testing?
A/B testing is a scientific method of running studies that relies on randomly splitting a test population into two or more groups and providing them with different variants of some “treatment.” There is always a control group in such a study that does not receive the treatment and which is used as a baseline to measure the effectiveness of the treatment. With a relatively large test population size, and given the randomized assignment of the variant groups, all the individual features of the population members are averaged and erased, and it is possible to estimate the average effect of the treatment on a member. In this post, we will use the terms “A/B testing” and “experimentation” interchangeably.
A schematic representation of A/B testing
Beyond scientific applications, it is possible to run A/B testing in online services and measure the impact of new features on the service’s users. Such an A/B testing system may seem simple at first, but will require elaborate efforts for complex ecosystems operating at large scale.
The Stone Age of experimentation
In order to develop an A/B testing platform, you must start with defining four main capabilities of the system, namely:
Evaluating whether a member belongs to or is eligible for an A/B test.
Assigning a member to a specific variant of the test.
Propagating information about the tests and their definition to a production environment and recording information about members’ participation in the test.
Computing A/B reports and measuring the test’s impact.
The very first version of the LinkedIn experimentation platform was very simple and provided the capabilities in the following way:
Every member was assigned a unique identifier, also known as “member id.”
The entire member population was split into 1,000 buckets, 0-999, and a member was assigned to a bucket by using a simple formula, namely bucket_id = member_id mod 1000.
In order to run a test, a developer had to allocate a range of buckets for it and make sure that each bucket in the range was not being used by any other test.
The test’s bucket range was then divided into subranges corresponding to different variants of the test.
As you can imagine, there were numerous issues with this approach. First of all, the process of assigning members to A/B tests did not follow a scientific methodology of randomized controlled trials and thus yielded invalid results.
Second, there was no centralized database for the test bucket allocation, which was performed via email exchanges—a slow and unreliable method. Technically, such a method has an algorithmic complexity of O(test_count ^2), because each test’s owner had to communicate to every other A/B test’s owner, which means that it did not scale with the company’s growth. To give you a better understanding of the complexity of running an A/B test in such a system, we found an email thread from July 2009:
A conversation that happened in July, 2009
Third, test definitions were scattered across the codebase, which complicated debugging and made comprehensive understanding of the state and history of A/B testing impossible.
Representation of A/B tests in LinkedIn codebase in 2009
Fourth, pushing an A/B test deployment to production required changing any affected service’s configs and deploying them. Essentially, it could take hours to activate a single test for a single service, and much longer if the test had to be shared between multiple services.
Fifth, reports for A/B tests had to be manually computed in a spreadsheet or using R, which was tedious, created room for human error, and could result in invalid results.
Manual process of A/B test creation, deployment, and analysis
The new era
Around 2010, it became apparent that we needed a new solution to satisfy the growing demands of the company, and to scale with:
The number of LinkedIn employees using experimentation for decision making,
The number of active tests,
The number of LinkedIn members.
By “scaling,” we mean adapting the system to the growth of one or more entity types constituting the system, impacting the system, or being processed by the system.
And so, a few bright minds decided not to put up with the status quo and to instead make a radical redesign of the platform. Their decisions created a foundation for the modern T-REX infrastructure and ensured its growth for the coming years.
Before we dive deeper into details, we need to define the core terms of the new system, namely the “test” and the “experiment”:
- A T-REX test is a representation of a hypothesis or a feature that we want to test. Therefore, a test is “active” if the hypothesis has not yet been confirmed or rejected, or the feature release has not completed yet, and is inactive otherwise.
- A T-REX experiment represents a single stage of hypothesis testing or a feature release, which is associated with a given allocation of tested variants. For example, if a feature is rolled out to 5% of a target population first, then gets expanded to 25%, 50%, and 100% of the population over time, then 4 T-REX experiments will be created and associated with four different allocations of the treatment: 5%, 25%, 50%, and 100%.
A/B tests and experiments in T-REX
T-REX tests are also important because they constitute a part of the interface between the experimentation platform and client application. A feature release can be A/B tested by creating a T-REX test, evaluating it in the code, and then taking some action based on the return value:
Conceptually, a T-REX experiment is designed as an immutable state entity that describes how a particular A/B test is evaluated at a given moment of time, including its segmentation rules and configuration of randomized population splits. These are defined with the LinkedIn experimentation domain-specific language, or Lix DSL (see our previous blog post “Making the LinkedIn experimentation engine 20x faster”).
You can only evaluate a test and not evaluate its experiments, which enables decoupling between the application and the state of the release, and allows for gradual rollout without changing the application’s code.
Better A/B test management
The test management and deployment infrastructure was created from scratch with the following structure:
Test management, deployment, and evaluation at LinkedIn
The new architecture had multiple benefits. First, it became much easier to perform the essential tasks of A/B testing on the platform after the change; the engineers gained the ability to keep track of active tests, debug experiment definitions, and roll out new experiments to production—all through the UI.
Second, experiment deployment latency was reduced to under 5 minutes, and the overhead of performing such experiments was significantly reduced, so it became possible to run gradual and smooth roll outs of features while A/B testing at every step.
An example of using targeting and experimentation during a feature release process
Third, a typical experimentation process prior to the platform overhaul would require you to target specific groups of the site’s population (as in the above figure), and letting developers handle that on their own would have required code changes with every new ramp and slowed down the pace of experimentation. That is why we made targeting an integral functionality of T-REX and decoupled it from the business logic of applications.
Fourth, we built the LinkedIn experimentation domain-specific language (or the Lix DSL) as the main interface of the targeting and experimentation capabilities of the platform.
A/B test execution
By 2013, the team had built a very efficient architecture for executing A/B tests. The platform uses multiple levels of caching for optimizing query latency:
The first stage of evaluation is performed by a client service that has the experimentation library integrated. The service calls the experimentation client (the Lix client) and requests a variant name for a given combination of a test name and a URN of an entity (e.g., a company or a job) that participates in the test.
The Lix client tries to perform the evaluation locally by utilizing an in-memory cache of experiment definitions and entity properties, with a hit rate of 98-99%.
If such attributes are not available locally, this signals to Lix clients to fetch them from the experimentation backend.
If the backend has the attributes in its cache (with a hit rate of 93%+), it performs the evaluation and returns the treatment.
Otherwise, the backend fetches attributes from a LinkedIn key-value store called Venice, puts them into the cache, performs the evaluations, and returns the treatment.
Evaluating a variant name for a test and an entity
Thanks to the caching system, less than 0.2% of evaluation requests hit the storage nodes, and less than 2% of evaluations require network calls.
Of course, the development of the platform did not stop there, and we have made numerous iterations and improvements, which have included improving speed of the experimentation engine 20x and developing a fast variant assignment method. A lot of work went into the creation and optimization of an offline A/B analysis data pipeline, the development of the experimentation UI, and the statistical methods. We will cover these topics in the next blog posts in this series.
You can also familiarize yourself with the comprehensive history of the T-REX platform by watching this presentation on the topic at the SF Big Analytics meetup:
We would like to thank all members of the T-REX team; without their hard work on the experimentation platform, this project would not have been possible. Big thanks to our management: Igor Perisic, Kapil Surlaker, Ya Xu, Suja Viswesan, Vish Balasubramanian, and Shaochen Huang, and T-REX alumni Shao Xie and Adam Smyczek for leading the team through many challenges and helping it become what it is today!