How LinkedIn scales compatibility testing

Nima Dini

Staff Software Engineer at LinkedIn

October 28, 2020

Introduction

LinkedIn has 12,000 Multirepo codebases, referred to as multiproducts, which represent individual software components developed by our engineering team across the globe. Every day, thousands of code changes are pushed through our Continuous Integration (CI) pipeline. Our ecosystem is rapidly evolving and we have seen a double-digit growth in the number of multiproducts year after year.

Our source control branching model is trunk-based development, meaning that developers push code changes frequently to the main branch of a multiproduct and avoid long-lived feature branches. A successful code push to our CI pipeline publishes a new version, which can then be deployed into production (for a service) or consumed by other multiproducts (for a library) enabling code reuse.

A key feature of multiproducts that makes our development ecosystem truly scalable is that library producers and consumers can release software independently. Specifically, library producers can publish new versions without having to coordinate with consumers, who can then upgrade to a newer version when they are ready instead of having to coordinate with the producers.

The strategy that enables this autonomy is to follow the semantic versioning (semver) contract. For instance, a backwards-incompatible API change to a multiproduct must increment the multiproduct's major version as a signal for the consumers. On the contrary, the major version of a multiproduct remains unchanged when code changes are introduced in a backwards-compatible manner.

To enforce this contract for all multiproducts, we integrated semantic versioning as a required check in our CI pipeline. Specifically, we leverage the builds and tests of consumers to validate backwards-compatibility of code submissions for producers. We call this check compatibility testing.

This blog post presents a high-level overview of compatibility testing at LinkedIn and the tooling surrounding it to make it workable at scale. We explain key challenges to developer productivity faced due to our rapid growth and our strategies to address them. We also discuss the next steps we are pursuing to further enhance the developer experience with compatibility testing.

Compatibility testing

Our CI system runs two main validation workflows for every code push to a multiproduct. First is the pre-merge validation job, which runs lightweight checks (e.g., linting) before the code is merged. Then comes the post-merge validation job, which builds the multiproduct and performs more extensive checks before publishing a new version.

Compatibility testing is a required check that runs in the post-merge validation job for libraries. Specifically, once a library is successfully built, the post-merge validation retrieves the library’s direct consumers (multiproducts) that need to be tested for backwards compatibility by querying our Dependency Backend service. It then uploads the library’s build artifacts to a temporary remote storage and triggers a compatibility testing validation job for each consumer.

Each compatibility testing validation job clones the codebase of a consumer (at its latest stable version) and builds it using the library’s artifacts that were previously uploaded. The pass/fail result is then emitted using a Kafka event and consumed by the post-merge validation job to determine whether to publish a new version of that library or report a failure. A simplified view of this procedure is shown below.

diagram-comparing-workflows-between-post-merge-validation-job-and-compatibility-testing-validation-job

Simplified workflows for post-merge validation job (left) and compatibility testing validation job (right)

Developer productivity challenges at scale

When we started employing this testing procedure back in 2014, it worked well for our scale. At the time, LinkedIn had 1,000 multiproducts; the most widely consumed multiproduct was a library that implemented a core Gradle plugin for Java, called gradle-jvm, which had a few hundred consumers. Today, that same library is consumed by 4,500 multiproducts in its latest major version. LinkedIn’s development workforce has seen a tenfold increase in size in the same timeframe.

In fact, a few years ago, we reached a point that a single code submission to gradle-jvm would take 14 hours to complete due to the CI system’s validation of compatibility testing. Developers working with this plugin had to submit their code and wait until the next business day to get feedback from the CI system.

Furthermore, every time that compatibility testing failed, library producers had to dig through the logs of each failed consumer (multiproduct), which was time-consuming. Often, a small percent of consumers failed not because of the code changes under test, but because of their own non-deterministic (flaky) tests, producing a false positive signal that blocked the producers from publishing a new version of their library.

With these challenges in mind, we decided it was time to revisit our compatibility testing implementation and expand our tooling surrounding it to keep up with our rapid growth. In the rest of this blog post, we explain how we enhanced the debuggability, stability, and performance aspects of compatibility testing across LinkedIn.

Debuggability

Compatibility testing failures can be difficult to troubleshoot. The debugging process requires library producers to dig through build and test logs for each failed consumer (multiproduct), which is time-consuming. Reliability issues in the infrastructure may also cause failures at this scale, due to transient network issues, hardware malfunction, or a bad rollout of the CI system that introduces a software regression.

Below are some of the features that the team built in the CI system to assist developers with the debugging process:

The ability to reliably determine if a failure was caused by the infrastructure or by a legitimate issue in the validation. For each failure, actionable and fine-grained errors are displayed on the UI with links to the relevant execution logs and helpful resources (e.g., documentation) that can assist users to better understand and fix the issue, or triage it to the right team as needed.
A comprehensive one-page failure report displaying a holistic view of compatibility testing results. For each consumer, the report shows the pass/fail result, a summary of the failure cause, and other relevant information that helps producers identify issues more quickly, without having to go through individual logs for failed consumers.
Infrastructure support to temporarily store the cloned workspace of failed consumers and make them available for developers to debug.
Guidelines and tooling to seamlessly reproduce a consumer failure locally so developers can use their favorite IDE to debug.

Stability

Compatibility testing can produce a false positive signal when dealing with consumers (multiproducts) that fail not because the code change in the library under test is backwards incompatible, but because of their own non-deterministic (flaky) tests. Such failures (even if a handful) block the publication of a new version for the library and put the burden on the producers to debug the failures in the consumer domain before re-submitting their code changes.

Often, while producers are busy debugging, newer versions of consumers will get published. Hence, by the time producers re-submit their code changes, the state of the system has changed, and a new set of flaky consumers might produce a false positive signal in compatibility testing, requiring further debugging. This loop can be endless, like a game of whack-a-mole. We realized we needed a strategy to break this cycle.

That’s why we built infrastructure support that empowers producers to make their code submissions resilient to consumer failures, while still receiving a reliable signal from compatibility testing. Specifically, the following configurable parameters can be customized by a library producer:

Failure threshold. An upper bound on the percent of consumers that are allowed to fail without blocking publication of a new version (by default, zero). Libraries with many direct consumers set this to a small non-zero value, like 4% for gradle-jvm (i.e., 180 out of 4,500 consumers).
Ignored consumers. A set of known unstable consumers that may be ignored for each code submission irrespective of the Failure threshold (by default, an empty set).
Required consumers. A set of key reliable consumers with stable builds and tests that must pass against each code submission irrespective of the Failure threshold (by default, an empty set).

In certain situations when tactical code changes in a library are needed to fix an urgent issue, the library producers may take an intelligent risk and use an override to bypass compatibility testing. Such code submissions will be marked with an override stamp and audited.

Performance

We used software profiling to instrument our CI system’s post-merge validation implementation, and identified three main performance bottlenecks. Next, we discuss these bottlenecks and our approaches for addressing them, which reduced the compatibility testing execution time for gradle-jvm from 14 hours to 2 hours.

Job scheduler
Our CI system uses our internal job orchestration service, Orca, to run jobs. Particularly, our post-merge validation is implemented as a job, which then runs compatibility testing jobs. Given our bounded machinery resources, we limit the number of concurrent runs per each post-merge to 300.

Our original implementation used to trigger jobs as a batch and then wait for all the jobs in that batch to complete before starting a new batch. However, we observed that oftentimes, a handful of long running jobs in each batch would create a bottleneck in performance for the scheduler, preventing it from kicking off more jobs.

We enhanced the scheduler to trigger a new job as soon as a running job completes, so that at any given point in time, we maximize our resource utilization. This new scheduling approach reduced the compatibility testing execution time for gradle-jvm from 14 hours to 7 hours. The difference between the two scheduling approaches is illustrated below.

diagram-comparing-the-before-and-after-in-job-scheduling-approaches

The original job scheduling approach (a) and the enhanced version (b) used by our post-merge validation

Compatibility testing preparation
Before the post-merge validation job starts compatibility testing, it needs to know the consumers to test and the success criteria. Our original implementation used to query numerous backend services for each consumer, and download raw data over the network as part of this preparation, which was inefficient.

To optimize, we moved this preparation to the server-side and replaced expensive network calls with performant DB queries. The post-merge job now makes one Rest.li API call to the server and retrieves exactly what it needs (e.g., the set of tests and the success criteria) to run compatibility testing. For gradle-jvm, this optimization reduced the preparation time from 2.5 hours to 10 seconds. Consequently, the compatibility testing execution time was reduced from 7 hours to 4.5 hours.

Timely enforcement of the compatibility testing success criteria
Recall that a library can define a failure threshold, allowing a small percent of consumers to fail. For instance, gradle-jvm allows 4% of its consumers (180 out of 4,500 multiproducts) to fail without the post-merge job blocking a new version of this library from getting published. Our post-merge job used to wait for all the consumers to complete their tests regardless of whether this threshold can be reached.

We optimized the post-merge job to publish a new version of gradle-jvm as soon as a sufficient number of consumers have passed their validation such that even if all the remaining consumers fail, the total failures would still remain below the failure threshold. This optimization reduced the compatibility testing execution time for gradle-jvm from 4.5 hours to 2 hours.

Future work

The compatibility testing developer experience has significantly improved at LinkedIn in the past few years. Yet, the team is exploring new opportunities to further enhance the experience and make our developers even more productive. We briefly discuss some potential ideas below:

Finer-grained error categories and UI enhancements. While LinkedIn’s CI system is reliable in categorizing failures and highlighting relevant logs, further breaking down certain (coarse-grained) error categories and making the UI (and logs) more intuitive and discoverable might help with debuggability.
Auto-detecting flaky tests. We introduced product-level configuration to make compatibility testing resilient to flaky consumers. While this was a good start, it only cured the symptom and not the disease (i.e., it did not eliminate flaky tests). Auto-detecting flaky tests and proactively working with consumers to reduce their numbers remains an area of improvement to further reinforce stability.
Consumer prioritization. We currently prioritize consumers run in compatibility testing based on historical execution data, meaning that we run consumers with slower builds first. While this heuristic in practice has helped reduce the overall execution time for successful post-merge validations, there might be more sophisticated heuristics (e.g., taking the historical success rate of each consumer into account) to further enhance performance.

Conclusion

In this blog post, we presented how we do compatibility testing at LinkedIn for our Multirepo ecosystem. We shared some of the key challenges in developer productivity that we faced due to our hypergrowth and discussed some of the tooling enhancements we made in our CI pipeline to keep up with our growing scale. If you are using a Multirepo source control model with stable versioned dependencies like us, we hope the challenges and strategies presented can help you better scale up your compatibility testing while keeping your developers productive!

Acknowledgments

A number of teams and individuals at LinkedIn supported us through various stages of this work. In particular, we like to thank Akshay Jain and the Orca team for their help in better utilizing our job orchestration service. We thank the Dependency Engineering team for their help in using the Dependency Backend service effectively. We further thank the Multiproduct team for their technical feedback and code reviews. Last but not least, we thank Nikolai Avteniev, Brian Beck, Loren Carvalho, Szczepan Faber, David Herman, Grant Jenks, Deep Majumder, Niket Parikh, Zachary Yang, and others for their invaluable feedback on this blog post.

Topics: Developer Experience/Productivity