Ocelot: Scaling observational causal inference at LinkedIn

December 13, 2022

Co-authors: Kenneth Tay and Xiaofeng Wang

At Linkedin, we constantly evaluate the value our products and services deliver, so that we can provide the best possible experiences for our members and customers. This includes understanding how product changes impact key metrics related to those experiences. However, simply looking at connections between product changes and key metrics can be misleading. As we know, correlation does not always imply causation. When making decisions about the path forward for a product or feature, we need to know the causal impact of that change on our key metrics.

The ideal way to establish causality is through A/B testing, where we randomly split a test population into two or more groups and provide them with different variants of the product (which we call “treatments”). Due to the randomized assignment, the groups are essentially the same, except for the treatment they received, and so any difference in metrics between the groups can be attributed solely to the treatment. Our T-REX experimentation platform allows us to do A/B testing at scale, adding 2,000 new experiments on a weekly basis, serving a user population of more than 850 million members.

However, there are many situations where A/B testing is either infeasible or too costly. For these situations, we turn to the field of observational causal inference to estimate the impact of product changes. We have previously published some case studies to illustrate the importance of observation causal inference in an article - The Importance of Being Causal.

In this blog post, we share more details on how LinkedIn performs observational causal inference at scale using our Ocelot platform. We will also cover the other important measures we put in place to ensure a high standard is met for our causal inference studies and ultimately the changes in product that improve the member and customer experiences.

What is observational causal inference?

As previously mentioned, sometimes A/B testing is not possible or too expensive, but we still might want to understand the causal effect of a change. A few examples include:

Estimating the impact of brand marketing campaigns. Most of these campaigns (e.g., TV, billboard, radio) cannot be randomized at the user level within the target region.
Estimating the impact of bugs or downtime from different sources. Quantifying the impact of bugs from different sources enables us to prioritize infrastructure resources. However, we would not want to run an experiment where we artificially randomize bad experiences between users.
Estimating the effect of exogenous shocks to the economy. We are interested in understanding how shocks to the economy (e.g. government policy changes, economic downturn) affect the labor marketplace. We cannot randomize who is impacted by the change and who is not.

In such cases, we would utilize observational causal inference, which is a collection of methods to estimate treatment effects when the treatment is observed rather than randomly assigned. In observational causal inference, we know the treatment status of each user, but the treatment assignment is not random so the raw metric difference between those who were treated and those who were not cannot be causally attributed to the treatment. In particular, the groups might be systematically different from each other (even after taking treatment status out of the picture), and so metric differences could be due to these underlying differences instead. This phenomenon is known as confounding (see Figure 1 for an example).

Figure 1. Example of confounding. The tables show sample data for the control and treatment groups respectively. It looks like the treatment results in larger outcome values (mean of the “Sessions” column), but this difference could be attributed to the fact that more highly active users took the treatment and these users tend to have larger outcome values, not because of the treatment itself.

Techniques in observational causal inference allow us to estimate the effect of a treatment correctly by adjusting for confounding. It is worth noting that a central difficulty of observational causal inference is that some confounding variables (“confounders”) are observed while others are not. Different observational causal methods have different assumptions and different ways to treat unobserved confounders.

Ocelot: LinkedIn’s platform for observational causal inference

Although observational causal inference is a well-studied research area, not all data scientists are fluent with the full suite of techniques. To make observational causal inference more accessible, easier to use, and faster to execute at LinkedIn, we built an internal web application to enable users to run complex causal studies with no coding effort. It aims to deliver estimates of causal relationships from observational data, along with robustness checks, to end users.

Figure 2 shows the high-level design of the Ocelot platform.

Figure 2. Ocelot High-Level Design

There are two major components in the platform. The first is the Ocelot web app (Ocelot UI + Ocelot web services). This is a web application used to run causal studies, present results and diagnostic information, organize study iterations, and share knowledge across the company. Here are some key features of the web app:

Provides a guided form to lead users through the causal study setup, including what output metrics will be measured, what the control/treatment labels are, what confounders should be included, and most importantly the time periods over which each variable (e.g. output metrics, control/treatment labels, and confounders) should be computed.
UI layer validation to avoid misconfiguration of the causal study. For example, metric dates for A/A robustness checks should be prior to the control/treatment label date. (We will explain what these checks are later in the post.)
Present a detailed report with the key results and robustness check (e.g. A/A tests, rerandomization test, coverage checks, etc. Please check the later section Ensuring robustness of study results to learn more.) status highlighted.
Feature peer-reviewed, high-quality causal studies with large business impact on the platform, so others can use them as templates to jumpstart their own causal studies.

The second component is the Ocelot pipelines, which are fully integrated data pipelines consisting of Java jobs, Spark jobs, and R jobs running on Azkaban (a LinkedIn open-source workflow manager), which both prepare modeling data according to the user configuration and executes causal modeling code. We chose to bundle the functionality of data preparation with the causal modeling for the following reasons.

First, correctly setting the variable dates is critical for the correctness of the causal inference conclusions. For example, Figure 3 shows the date requirement for the fixed effect model (FEM). For the outcome metric to be the result of the treatment, it has to be measured after the treatment has been administered. Meanwhile, the covariates need to be measured before the treatment time period. For a typical FEM with four time periods, there will be 24 dates involved (for each of the three time periods, we need to set the start and end dates). It can be an easy mistake to have an overlap between dates across different time periods but by providing this functionality in the Ocelot platform with UI validation, both users and reviewers do not have to worry about the correctness of the data preparation.

Figure 3. An example of a Fixed Effect Modeling Date Setup

Second, LinkedIn has 875+ million members and keeps growing. Joining large-scale member data with many confounding variables requires skillful data engineering practice. We fine tuned Spark jobs to reduce the data preparation time and failure rate. Ocelot also is integrated with our internal feature store, Feathr, where users only need to select covariates by names. Ocelot handles the join logic to ensure modeling data correctness. For example, if users wish to control for session count in the previous week, they can simply pick “macrosessions_sum_7d” as a covariate in the causal study configuration. Our pipeline will map this covariate name to the corresponding data sources and aggregate the seven-day values according to the different date configurations for both causal modeling phase and robustness check phase. To further boost the productivity, we work with the domain experts to pre-define a standard covariate set, which currently includes more than 200 commonly used covariates.

Lastly, we can enforce the best practice of always running robustness checks along with causal modeling without requiring users to prepare the data multiple times.

Figure 4 shows the five methods offered on our Ocelot platform. They are Coarsened Exact Matching (CEM) and the Doubly Robust (DR) estimator (also known as the augmented inverse propensity weighted estimator) for cross-sectional data, Instrumental variables (IV) estimation when an instrument is available, Fixed effects models (FEM) for panel data, and Bayesian Structured Time Series (BSTS) for time series data.

A typical Ocelot user journey is illustrated by the following screenshots. First, the user logs in to the Ocelot platform and picks a causal method (Figure 4).

Figure 4. Ocelot Landing Page

Then, the user looks at the featured analyses and/or past analyses to learn how to set up parameters for their causal study (Figure 5). Users are also free to create an analysis from scratch.

Figure 5. Methodology Landing Page & Past Analysis History

Third, the user fills in a guided form to set up the causal study, and executes the analysis with a click of the button (Figure 6).

Figure 6. Create & Execute a New Analysis Page

Next, the user reviews the results (Figure 7).

Figure 7. Results Page

Lastly, different from simple deep dive analysis, the user usually iterates on the causal study by including different covariates or changing study population. All the past execution history is captured to avoid p-hacking (Figure 8).

Figure 8. Iterate on Causal Design

Then, if the results are ready, the user can submit the analysis for our committee’s review (Figure 9).

Figure 9. Request Committee Review

As shown in the previous screenshots, our Ocelot platform provides the following convenient features to increase users’ productivity:

Capture causal study description, goals, and tags. These are searchable so that new users can easily learn how to run observational causal studies.
Clone function to jump start a new analysis or a new iteration by copying the past causal analysis configuration.
Data visualization to help users configure causal study dates correctly and highlight key results.

Since the launch of the Ocelot platform in 2019, we have been successful at democratizing and expediting observational causal inference within LinkedIn’s data science community. For example, prior to Ocelot, it usually took a few data scientists (one experienced observational causal inference expert and one domain expert) up to six weeks to design the causal study, build the data pipeline to create the dataset, write ad-hoc causal modeling scripts, and validate and analyze the results. Due to the resource-intensive nature, there were only 10-20 observational causal studies produced up to that point. With the Ocelot platform, domain expert data scientists can run causal studies on their own, and it usually takes them just a couple of hours to learn the tool and execute a simple causal study. Considering all the iterations and reviews, we can complete a thorough causal study with less than one week’s effort. Since the launch of Ocelot, we have more than 50 casual studies every year, with many of them providing deep insights into the LinkedIn ecosystem and influence on the product strategy. (A few of those studies have been discussed in the article “The Importance of Being Causal” in the Harvard Data Science Review.)

Ensuring robustness of study results

Because the estimates from observational causal studies are used to guide product decision-making, they must be reliable. Hence, the design of our studies, the methods we use, and the way we interpret the results must meet a high bar of rigor. We do this in two main ways at LinkedIn: 1.) a central review committee, and 2.) implementing automated robustness checks on Ocelot.

We have set up a central review committee that vets the design of observational causal studies and guides proper interpretation of study results. Study design and results are presented and discussed weekly, and treatment effect estimates can only be interpreted as causal if the committee deems the study to be rigorous enough.

The committee consists of members from the horizontal Data Science Applied Research team and data scientists from each product. Having both types of members is key to the committee’s success as they bring complementary strengths to the table. While all members of the committee have a strong understanding of observational causal inference, members from the horizontal team have deep technical expertise on the methods in experimentation and causal inference, with the ability to develop new methods when needed. Product data scientists have the domain knowledge to ensure the study’s design and interpretation of results make business sense.

The central committee also plays a vital role in raising the level of knowledge of observational causal inference across LinkedIn. Members from the horizontal team distill and share the latest advances in observational causal inference and updates to the Ocelot platform, through documents, presentations to vertical teams, as well as through the product data scientists in the central committee. Product data scientists also act as the “champion” for observational causal inference in their line of business, seeking out opportunities where observational causal inference could be helpful, and giving advice to team members who are running observational causal studies.

In addition to manual review, methods on the Ocelot platform have automated robustness checks which, if passed, increase confidence in the treatment effect estimates. If the robustness checks fail, that means the methodology cannot be used to estimate the causal effect and the user is not allowed to claim that the estimates are causal. For most of the methods on Ocelot we have some version of the A/A test. The A/A test is easiest to explain in the A/B testing setting: in an A/A test, we randomly split the test population into two groups but give the groups the same treatment. Since the treatments are the same for both groups, we do not expect any metric differences. Statistically significant metric differences suggest that something is wrong with the study design, and the results cannot be trusted. For example, A/A test failure suggests that treatment assignment is actually not random, and there may exist confounders that contribute to differences in the outcome, and should not be mistaken for the treatment effect. For observational causal inference, we find settings where the treatment effect should be zero after adjusting for confounding. The A/A test fails if the treatment effect estimate is significantly different from zero, even after adjusting for confounding.

While robustness checks help to increase trust in study results, we note that observational causal methods often require assumptions that are impossible to verify (e.g., no unobserved confounding for the doubly robust method, the exclusion restriction for instrumental variables). Review committee members, in conjunction with study owners, use domain knowledge to assess the reasonableness of these assumptions in the study’s context, and to point out these assumptions whenever the study results are used. This also underscores the importance of sound study design.

Conclusion

Observational causal inference is an important complement to A/B testing, enabling us to measure the effect of product changes when we are not able to randomize the treatment among users. Our Ocelot platform enables us to do this at scale in a robust manner. We are continually thinking about which methods to add to the platform and how to ensure that observational causal inference is done rigorously: if you have any thoughts on the topic, we would love to hear them!

Acknowledgements

We would like to thank our colleagues Xiaonan(Kate) Ding, David Tag, Donghoon (Don) Jung, Rina Friedberg, Min Liu, Albert Chen, Vivek Agrawal, and Simon Yu for building the Ocelot platform; YinYin Yu, Weitao Duan, Dan Antzelevitch, Parvez Ahammad, Zheng Li, Sofus Macskassy, Souvik Ghosh, and Ya Xu for their continued support and leadership to advance the observational causal studies platform. We would also like to thank many internal users who provide valuable feedback to improve the platform, especially Rose Tan, Ming Wu and Joyce Chen. Finally, we are grateful to the LinkedIn editorial team for their comments and suggestions on the earlier versions of the blog.

Topics: A/B Testing/Experimentation Scalability Infrastructure