How LinkedIn handles merging code in high-velocity repositories
April 30, 2020
Editor's note: This blog has been updated.
The product ecosystem at LinkedIn is vast, and managing its infrastructure can be daunting. Nearly 10 years ago, we transitioned away from a single monolithic codebase toward a microservice architecture, allowing teams to manage their own repositories. Today, we have thousands of developers working on more than ten thousand active repositories, some of which have upwards of 300 commits a day flowing through them.
We practice trunk-based development, in which developers of a given repository push changes to a main development branch (e.g., "primary"), rebase frequently, and avoid long-lived feature branches. In this blog, we will focus on how our Continuous Integration (CI) system is able to work with repositories of different sizes, specifically ones with a high velocity of commits being merged into primary, to ensure timeliness and code correctness.
In any code management system (especially when trunk-based development is used) there are two main constraints to consider in relation to developer productivity. The first constraint is that primary should aim to always be “green.” This means that at any point in time, if the codebase is checked out at HEAD (the latest revision), it should be able to compile and build successfully, as well as generate any necessary artifacts. When primary becomes “red” (i.e., it is unable to build successfully at HEAD), it requires either a revert back to the last known good revision or for an engineer to step in to make a new change that fixes the problem and allows the product to build again, often referred to as “fixing forward.” Both of these options can potentially take a significant amount of time, causing delays for any newly added features or fixes that were added between the bad change and fix.
The second constraint is that developers expect their changes to land in the repository in a timely manner. The simplest approach to merging multiple changes to a single codebase is to queue up every change that is being submitted and then, one at a time, perform validation/testing and merge them into the repository. The biggest drawback to chronologically merging changes like this is that it simply does not scale for high-velocity repositories. Consider a product that takes an hour or so to build, or one that requires a large amount of validation work before a change can be accepted. If there are multiple engineers each pushing multiple commits a day to this repository, the queue of changes will continually grow and things will quickly spiral out of control. Another approach is to batch commits together and if any of them fail, reduce the size of the batch and try again. However, this too runs into issues at scale when you have, say, 20 commits in a single batch, as a single bad commit could cause unnecessary reruns.
In addition to the previous two, a third constraint exists for us due to our usage of a microservice-based architecture; the system should remain performant for both high- and low-velocity repositories. It is important that the system doesn’t optimize only for high-velocity repositories and cause developers of low-velocity repositories to suffer productivity losses. We solve this by allowing developers to customize some of the validations that they wish to run throughout the pipeline (as well as when they wish to run them), as we will discuss in the next section.
LinkedIn uses both pre-receive (pre-merge) validations as well as post-receive (post-merge) validations in order to ensure the satisfaction of the constraints that we have laid out above—this is the norm of many trunk-based development CI systems. Once a developer is ready to push their change, they use an internally-developed git sub-command called “git submit” rather than running “git push.” This CLI will kick off a validation job and immediately return control back to the developer. As depicted below, the job will then run pre-merge validations and do a “git push” on the user’s behalf, rebasing as necessary. Once complete, the code will be merged into the repository and post-merge validations will be kicked off through a server-side post-receive hook.
“Git submit” workflow
It is important to note that the developer’s local repository does not have to be up to date with remote HEAD in order to successfully invoke this CLI (unlike “git push” itself). This was a huge pain point for developers of high-velocity repositories and is actually one of the reasons that this system was built in the first place. Developers would end up in an endless cycle of running tests, trying to push their changes, being told their HEAD was out-of-date, rebasing, and then repeating the process. With the “git submit” workflow, the pre-merge validation job does the push for the developer (while rebasing to keep up with the remote HEAD), as depicted below.
Pre-merge validation job workflow
In the event that the pre-merge validation job for a change submission ends with a “success,” the post-merge validations are kicked off after the “git push,” as displayed above. For most products at LinkedIn that are relatively low velocity (~5 commits a day), pre-merge testing is used for lightweight validations (linting, static security checks, etc.), and the build or compilation step occurs during post-merge testing. Engineering teams of low-velocity repositories tend to favor this quick pre-merge validation because the timeliness of their change making its way to production outweighs the effects of a bad change being merged. In the case of a bad change getting into primary, the post-merge testing will fail and the developer can simply fix forward or revert the change. Additionally, developers have the option of enabling an automated revert mechanism, which automatically reverts any changes that cause primary to be red. This can be particularly helpful for removing bad changes quickly and getting primary back into a green state; however, it can also cause churn for repositories with excessive flakiness. For these low-velocity repositories, a bad change getting into primary is not as impactful as it is to high-velocity repositories, where teams pursue continuous delivery with multiple releases per day.
For high-velocity products, heavier pre-merge testing is used in order to reduce the risk of any bad changes making their way into the repository. This is because a single bad change making its way into primary can cause a lot of hassle, such as delayed rollouts and even lowered local development productivity (as a fresh checkout of the product may now result in build failures). Thus, the highest concurrency products tend to run the same validations pre-merge as they do post-merge. This makes the pipeline much slower for these products, however we are able to mitigate this time-loss by using concepts such as caching to optimize this process. As mentioned above, the pre-merge validation is required to limit bad changes from getting into the repository, while the post-merge validation is needed to catch soft conflicts (these occur when two commits that build successfully have no merge conflicts, yet build unsuccessfully when merged together).
LinkedIn follows a “first to the finish” model, and does not preserve the sequence of commits (i.e., we do not do First In First Out). Based on our analysis, First In First Out (FIFO) does not meet our scalability constraints, whether through a simple “one at a time” ordering or through a batching of commits, as discussed above. Even in the case of running validations concurrently, but only merging in order, developers are likely to run into issues where many commits are potentially stuck waiting for a commit that started ahead of them to finish validations (which may have been delayed due to a myriad of reasons, such as lower CPU resources or running extra tests). Essentially, there is no such thing as a perfect CI system at scale as it is extremely difficult to optimize for both speed and correctness. So, as with any other system, there are tradeoffs. Below are the pros and cons of our merging strategy:
|Malleability for each repository as validations can be customized in pre/post-merge||Ordering is not preserved, FIFO is not used|
|Primary is nearly always green (roughly 5 soft conflicts/year for the highest-velocity repository at LinkedIn)||Higher-velocity repositories have to run validations twice to ensure “bad code” does not get in and stay in|
|High throughput as commits don’t wait for commits pushed before them|
There are other systems out there that use concepts such as machine learning to determine the likelihood of a certain commit (or set of commits) having conflicts. However, we require our workflows to be more malleable and composable so that engineering teams can choose what works best for them. In other words, we can’t use a “one-size-fits-all” solution, as our workflows must be performant for repositories of all sizes and must take into account that a single repository’s workflow may change over time (as the number of developers working on it may grow or diminish). Thus, we use our model to increase extensibility, while allowing for optimizations on code correctness and timeliness based on the needs of the repository. By sharing the pros and cons of our approach to handling merging at scale, we hope to empower organizations to think carefully and strategically about the long-term impacts of code management systems on scaling developer productivity.