Site Speed Monitoring in A/B Testing and Feature Ramp-up

Jiahui QI (JOY)

Engineering Manager at Meta

June 21, 2017

Everyday, LinkedIn serves hundreds of millions of pageviews to our members, from job searches to the news feed. Our network has grown to over 500 million members, and throughout our journey, “members-first” has been a fundamental value that we’ve carried. For our mission to continue improving the member experience, LinkedIn’s site speed infrastructure is a critical component because it provides site performance metrics to the engineers who develop and roll out features to our members.

Predicting site speed impact of a feature rollout is a difficult engineering problem, particularly at scale. We are ramping hundreds of changes simultaneously through A/B testing. When ramping a feature into production, how can developers gain visibility into the site speed impact? Likewise, when a performance optimization is enabled in production, how can the developer quantify the benefit? In addition, how can performance degradation be detected at an early stage of feature ramp-up, before the impact spreads out to a larger audience? In this blog, we share our experiences and solutions here at LinkedIn.

Site speed A/B reporting

At LinkedIn, features are ramped up through an A/B testing platform called XLNT. For a feature to be deployed in the production environment, it typically goes through several ramp-up stages that span days or even weeks before the feature is fully rolled out to every member.

We collect Real User Monitoring (RUM) data from all our web pages and mobile applications. In RUM, basic metrics, such as Navigation Timing and Resource Timing, are collected. In addition, we collect a lot of debugging details using markers to indicate the performance of components. This is the source of truth for site speed at LinkedIn.

On top of this, we built a system to slice and dice the RUM data to allow developers to visualize and understand site speed changes along the entire A/B ramping cycle.

During A/B testing, we compare site speed metrics provided by RUM, such as traffic and page load time, between two groups of users: an experimental group, and a control (or ”baseline”) group. By setting other variables, e.g., country, into the same category for both groups, we are able to do a fair comparison between the two sets of results in order to differentiate the performance impact.

Site speed A/B data are available in two flavors today: daily and real time. Daily data is processed in Hadoop. It is more reliable, because daily site speed variance is relatively small. Real-time data, which is processed in Apache Samza, is aggregated into 10-minute windows and is primarily used for anomaly detection and alerting purposes.

The choice of using daily summarized data versus real-time data is a practical tradeoff. While real-time data is available quickly and is useful for anomaly detection and alerting, the visualization is often noisy. Conversely, daily data is only available much later, but often provides a better understanding of the data. The distinction in practice means that they are applied to different problems. For example, real-time data is more suited for alerting. When a feature is ramped and accidentally causes a site speed degradation on a particular page, the real-time result can raise awareness in a meaningful timeframe. By having an alerting system on top of real-time data, developers and experiment owners can be notified about the issue early. Meanwhile, daily data is better suited for quantifying a performance impact that developers are already aware of. For example, when developers finished a feature that improves the speed of a web page, the precise difference could be summarized comprehensively. Based on the result, owners can decide whether they want to continue ramping the improvement feature or not.

Use case

Let’s use a real example at LinkedIn to show how we use this framework to monitor and improve site speed. We noticed LinkedIn’s web profile page was slow due to ads needing time to load. Engineers decided to optimize ads in order to solve this problem, and rolled out their changes. To see the performance difference and business impact before we decided to fully ramp up this feature, engineers used our site speed A/B test report to monitor ads click rate and site speed metrics in the 10% A/B ramping stage. From our visualization dashboard, engineers were able to check daily page load times at 90th percentile and traffic trends for both control and experiment groups.

From first chart on the UI, we can see that the page load time of the experiment group is 20% faster than that of the control group for both countries tested. For the second chart on the UI, traffic ramping was reflected by a traffic count decrease on the control group and a traffic increase on the experiment group. On the business side, this site speed optimization resulted in ads revenue increase, demonstrating concrete business value. Based on these data, engineers were able to quantify the performance and business benefits from this optimization, even at 10% ramping stage. After making sure business metrics and performance metrics were good, engineers were able to ramp up this optimization to 100%.

Overview of site speed daily A/B comparison for profile page (control vs experiment group)

Architecture

Overall architecture of end-to-end pipeline

Now, let’s describe more details on how we collect, process, and visualize the data. When a member loads a LinkedIn page from a browser or an app, a Kafka event with performance metrics is sent. If the member is part of an experiment segment, related experiment information will be used to generate a Kafka experiment event. In both cases, this is done automatically via libraries embedded into the applications stack, and the events are sent to their respective Kafka topics, and subsequently stored in HDFS.

The offline pipeline, running on top of Hadoop, consumes the two sets of data from HDFS to process site speed A/B data daily. Another pipeline, this one online, consumes Kafka events directly and runs on top of Apache Samza to process site speed A/B data in real time.

The daily offline pipeline is running on Hadoop and Spark. It first preprocesses two sets of raw data, massaging raw site speed data into a lighter message version containing information such as dimensions (page name, country, time, mobile device type, etc.) and metrics (e.g., page load time, TCP connection time, first byte time, etc.). A Spark job joins the experiment data with the site speed data on a common attribute ID called “memberId,” aggregating the joined data by pages, countries, dates, and experiment details (experiment, segment group), as well as computing quantile (90th/50th percentile) on top of the metrics. The result is stored into mySQL as daily aggregated data.

Real-time data processing pipeline overall architecture

The real-time pipeline is built in a similar manner. For each real-time processing pipeline, we first apply a job to transform the data to a format that simplifies processing. Then, we have a job dedicated to joining two events with the same common ID. This usually involves storing an event for some period of time until matching events arrive, or until some timeout threshold is reached. Once a match is found, an aggregated event with the site speed performance metrics and the experiment identifiers is generated.

The final component is an aggregation job, which consumes joined events from the previous step. This job groups events with same page, same country, same experiment, same segment, same treatment, and same 10-minute time window into one bucket. Inside of this bucket, various performance calculations are made. For example, we may generate and compare the baseline and experimental values on 50th/90th percentile page load time, sample sizes, and much more.

Challenges and solutions

The system is built to solve these challenges:

Scalability: Every day, there are hundreds of feature ramp-ups and millions of page visits on LinkedIn. We have large amounts of members all around the world actively using our sites, and hundreds of pages keys. Therefore, the resulting combined data set is big. For each resulting data set, we need to calculate a large number of performance metrics. The platform needs to scale to handle large volumes of data calculations. Our choice to use Spark in offline processing has proven to be a good decision, nearly 10x faster compared to Hadoop. Spark empowers the offline pipeline to process more page keys and experiment combinations in a shorter amount of time. Apache Kafka allows for a reconfigurable number of partitions per topic to enable higher levels of parallelization. In our real-time design, scaling out to a higher number of user-generated events is simply a matter of adding additional partitions and Samza workers.
Anomaly detection and alerting: Results need to be available as soon as possible to be effective in minimizing any negative impacts. We added an alerting feature on top of the real-time pipeline to help achieve this goal. After an experiment begins, an anomaly detection engine will wait for 3.5 hours before starting A/B comparison. The engine queries MySQL DB to fetch the latest real-time results, does a real-time A/B comparison on the data, and then decides whether an alert should be triggered. We also built an email alerting system and dashboard to help developers monitor and debug performance issues. The figure below shows an alert we sent to notify the feature owner that there was a 5.45% site speed degradation from the feature ramping to the web homepage.

Alerting email when page degradation is captured

Usability: To make the results accessible, an intuitive UI is needed for visualization. We built a dashboard on top of our performance debugging tool, Harrier, where developers can select different dimension combinations and visualize metrics they are interested in. Below is a query selection example. The resulting A/B testing chart is shown in the “Use case” section above.

UI to allow user to select site speed A/B query

Acknowledgements

The accomplishment of building this entire system could not have been achieved without the contributions from our engineering team. Thanks to Xiaohui Sun for implementing Spark Quantile function, Nanyu Chen for designing and tuning a real-time anomaly detection algorithm, Steven Pham for building the excellent UI, and Jimmy Zhang for helping on the alerting component. Additionally, thanks to Ya Xu and Ritesh Maheshwari for invaluable feedback and support.

Topics: Analytics A/B Testing/Experimentation Infrastructure