Managing Software Dependency at Scale

Szymon Gizecki

Senior Staff Software Engineer at LinkedIn

September 6, 2018

Co-authors: Szymon Gizecki, Yu Li, Chinmaya Dattathri, Ethan Hall, Irina Issayeva, and Deep Majumder

Introduction

At LinkedIn, we have more than 10,000 separate software codebases, referred to as multiproducts, which represent individual software products developed at LinkedIn. Each multiproduct is made up of various modules, which may have hundreds of transitive dependencies or more. Multiproducts can also be assembled into more complex multiproducts.

This leads to a complex graph of dependencies. Dependency resolution across these graphs is probably the most complex part of the build pipeline that LinkedIn engineers have to deal with on a daily basis. It is very important to us to make dependency resolution and management experience better for our engineers. The experience needs to be intuitive, consistent, reliable, and fast. In this article, we present our recent efforts in building a dependency management service at LinkedIn to meet these goals.

First, we need to manage these dependencies at the level of binary artifacts instead of using most recent version of the source code, allowing us to manage the sources specific to each multiproduct as its own Source Code Management (SCM) repository. Build systems like Gradle and Maven manage these binary dependencies at build time in an automated fashion, which makes it possible to build large-scale projects with complex dependencies. Most build systems operate by simply managing or capturing dependencies at the module level; however, here at LinkedIn, we also need to manage our dependencies at the multiproduct level—our unit of abstraction.

Given this graph of binary dependencies across our products, when we make changes to a product, we need to ensure that we are not breaking any other products that depend on it. We achieve that through what we call dependency testing.

Once we build and successfully test a product and publish its constituent binary artifacts, we then need to understand the dependencies of these published artifacts across all of the products. We need to answer questions that Gradle and Maven cannot answer, like the following:

What is the impact if we remove an End Of Life artifact from Artifactory?
What dependency tests should we run after a code change is made?
What version of a given module ended up in a deployed war or pex file?
Whether anyone uses a version of a library with a critical bug?

The legacy solution

Our previous dependency management solution known as LinkedIn Dependency Service was built using an off-the-shelf graph database. This solution focuses on managing dependency at the product level and was initially acceptable. However, as the organization grew and the complexities of products grew, we quickly discovered that this solution had several major limitations.

Firstly, the coarse granularity of managing dependency information at the product level without inadequate module level data—where the actual dependencies are created— caused a lot of problems. Build platforms like Gradle and Maven naturally operate at the module level when it comes to resolving version conflicts in dependencies. Therefore, managing dependency data at the product level with these tools can lead to impedance mismatch unless proper steps are taken.

Secondly, in the legacy solution, the version conflicts—which are resolved at the module level—were not resolved in the recorded product level dependencies, so multiple potential dependency paths co-existed within the data without indicating which exact version of a dependency is really being used in a specific situation.

The module level information that was recorded was mainly for Java purposes because that was the predominant development ecosystem at LinkedIn. With time, though, that has changed, and the legacy system seemed very limited to cope with the evolving language landscape at LinkedIn. We need the dependency graphs to be generated using dependency resolution strategies best suited for individual development platforms.

Our preferred build platform, Gradle, works best with Java. When it comes to CocoaPods for iOS development or JS libraries managed by Yarn or npm, we need to be able to deal with dependency graphs that are natively produced by their resolvers. Furthermore, the dependency graphs in the legacy solution lacked the ability to distinguish various classes of dependencies such as build, test, deployment, and runtime.

As a consequence of the limitations of the existing service, the tools and other services relying on this were often very conservative or incorrect, leading to developer frustrations due to:

False errors of circular dependencies
False warnings that end-of-lifed libraries were used
A lack of ability to distinguish different classes of dependencies and libraries, such as test framework that would unnecessarily be forcibly upgraded by the resolution strategy, leading to test failures!

Enter the new dependency service

As the company grew, the existing solution for managing dependencies across all products was not scaling well at all. We needed the service to reliably and accurately answer questions about product dependencies, direct or transitive. To address this challenge, we went back to the drawing board and built a more robust dependency management service.

The new service is designed to efficiently and accurately answer questions like

What modules does a product produce?
What are all the dependencies of a given module or a product regardless of the type (iOS, Java, Pig Job, etc.)?
What is the entire dependency graph for a given library?

Sometimes, we also need to know which modules or products depend on a given external library and whether a given library dependency is used only during the build and test period or if it is also used at runtime in production. This is important for us to know especially when we have problems of bad and ill-behaved libraries that need to be deprecated as soon as possible. The resulting service implementation delivers the following improvements.

Accurate, fine-grained, and fully resolved dependency graph at per module per configuration (based on Apache Ivy) level. It includes Toolchain, Java and container dependencies, which can be leveraged by deployment and other tools.
It captures data using build tools like Gradle, accounting for version conflict resolution done by the tool and LinkedIn specific dependency substitution rules. No more multiple confusing versions of the same library in one dependency graph.
It supports the importing of dependency graphs agnostic of programming languages and build tools.
Last but least, it exposes a well defined Rest.li API

Figure 1: A simplified view of the system architecture of the unified dependency service

A scalable solution

While building this service, we wanted to ensure that the system can scale along with LinkedIn’s engineering organization. Our goal is to be able to capture the entire dependency graph representing all of the products and all of their versions in active use at LinkedIn. Therefore, scalability is needed both in terms of the amount of data we need to manage and the query performance we offer to all the users of this service.

One of those key consumers is the CI/CD pipeline at LinkedIn that uses the data for resolving dependency graphs during building and testing of products. To meet these requirements, we moved from a graph database solution to Espresso, LinkedIn’s internal NoSQL document store. Espresso helped us better align with the nature of queries that the system is expected to serve.

In addition, the service leverages the rest of Linkedin’s standardized tech stack, such as Rest.li for service API and Kafka for messaging, the same stack that powers our member site.

A rich graph model
Any given product or multiproduct at LinkedIn is typically made up of one or more modules, which are the entry points into the dependency graph of that product. The dependency information of a module can be shaped as a Directed Acyclic Graph (DAG) that connects various other modules that go into building that module. Each of these modules is a node in this graph, and the direct dependency relationships of the two modules are the edges of the two nodes. This graph model on which we base our management approach, is agnostic of any programming language.

We call this graph model a rich model because besides relationships it also includes information about dependencies on configurations (Ivy-based) and toolchain plugin dependencies (tools, and their plugins that are needed to build and test).

One of the bigger dependency graphs could have more than 800 nodes and 4,500 edges. Let’s look at an example of a dependency graph in Figure 2.

Figure 2: A Dependency graph example

This is a small subset of the dependency graph for a sample multiproduct called backend. One of its constituent modules is called backend API. It then depends on various (three shown here) modules from other products. The data representing each of the nodes or modules in this graph is based on Apache Ivy representation of artifact metadata.

Managing the graph
To store the huge dataset of dependency graphs, we evaluated various storage technologies such as Neo4j, MongoDB or Cassandra but concluded that none of them fit our need. Our earlier implementation used a commercial graph database but it could not handle the scale of our data. Because the graph of a product at a specific version, once created, is immutable, our use cases were not really about chasing edges in real-time which is a big advantage of using a graph database.

On the other hand, the onboarding and operational costs would be too high for MongoDB or Cassandra. Even though we deal with DAG data, we do not really need to manage the data as a graph. Finally, we decided to use Espresso, which was already being used to manage LinkedIn’s large member data.

As previously mentioned, Espresso is LinkedIn's scalable and elastic data-as-a-service infrastructure. We chose to leverage Espresso because it provides simple, document-oriented operations, including key-based access and secondary index-based retrieval. It can handle datasets of our size without a problem.

We pre-compute fully resolved graphs and their mirrors and store them in Espresso. As of today, we are managing the dependency graphs of more than 10,000 multiproducts into this database. That translates to totally 1.3 billion nodes, 4.8 billion edges, and 5 terabytes of data.

For Java-based products we leverage Gradle to resolve and generate a dependency graph. For the example in Figure 2, we generate a graph with backend API as the root module. The generated graph is mapped into a set of four tables: Configuration, Dependency, Depender, DependencyEdge. The primary and secondary keys of the tables are designed for fast search and balanced partition.

Importing the graph
A product's graph gets populated when the product’s constituent modules (e.g. dependency backend API) are successfully published to Artifactory. With our CI/CD pipeline pushing 20,000 to 30,000 jobs per day, one obvious goal here is to ensure that the latency between an Artifactory publish and the graph update to be very low.

In a binary, dependency-based system such as ours, for a successful build and publishing of a product’s constituent modules, all its dependencies must already be available in the Artifactory.

Most of the dependency graphs are imported in near real-time. The background import process is triggered by two separate events as illustrated in Figure 3.

The Artifactory Event Publisher emits an event when a new artifact is published, and the Artifactory Event Listener consumes such events and re-emits the import events with exact coordinates of modules to be imported. The Dependency Importers consume such events, build the graphs, and then updates Espresso.
An independent Event Scheduler sends cron style events to trigger the re-scanning of Artifactory for any missing artifacts.

Figure 3: Dependency data ingestion workflow

Results to date

The new dependency service has been up and running since the beginning of 2018. At the time of writing this post, we have transitioned all queries of the legacy service to the new service representing the entire LinkedIn dependency graph. There are currently 12 API servers and more than 30 importers deployed in four regions. They currently serve up to one million dependency related queries every day.

The biggest benefit of this new implementation comes from the high quality of the data, backed by a highly scalable data store that allows us to manage and serve the full product dependency graph at LinkedIn. We have already noticed that the number of false negatives for circular dependency error has gone down by 60% which significantly improves our developer productivity.

Moving ahead

Multiproduct-based software development at LinkedIn is wrapped around the concept of dependency management at a very fundamental level. Therefore, the dependency service is a foundational piece of Linkedin’s CI/CD pipeline. The new service creates a significant opportunity for us, by allowing us to implement many features which were very hard or impossible to implement before.

Some of our upcoming capabilities that leverage the high-quality rich data from this new service are as follows:

Dependency Explorer UI to visualize the dependency graph and identify the difference between versions
Improvements to dependency test to allow more granularity in how we determine consumers
A better ability to automatically perform dependency updates based on actual module usage
Smarter validations so that there are no more noisy warnings that a product uses an end-of-lifed library, even though it would not end up on a classpath.

Acknowledgments

A number of teams and individuals at LinkedIn have supported us through the design and implementation cycle of this project. The team would especially like to thank Kumar Pasumarthy and the Espresso team for providing all the technical support and guidance along the way.

Topics: Open Source Scalability Infrastructure