Scalability

Who Depends On Me? Serving Dependency Queries at Scale

Co-authors: Yu Li, Szymon Gizecki, and Chinmaya Dattathri

To ensure we have significant flexibility in how our teams collaborate, our trunk-based engineering development workflow manages dependencies on a binary level, instead of source level. This requires very efficient and sophisticated management of the resulting dependency graph, and we discussed our approach to this in a blog post last year, Managing Software Dependency at Scale, which described our dependency service. Today, we wanted to provide an update on how we use this service for one of our most common dependency queries—Who depends on me?—in a scalable manner. With a new efficient algorithm, we’ve scaled this type of query by 1,000X to support more than 25,000 software components at LinkedIn.

Let’s start with the two most common use cases for this query:

  • Compatibility test: At LinkedIn, we implemented a compatibility test in our CI/CD pipeline to ensure upstream libraries respect semantic versioning. This means they would not introduce backward-incompatible changes when there is no major version bump. For any code change submitted to an upstream library, the compatibility test needs to find all downstream consumers who currently consume the same major version of the submitted change and run the test against the submitted change. This is one of the foundational pillars of ensuring scalable software development at LinkedIn.

  • Identifying consumers of critical libraries: We often have horizontal initiatives to upgrade critical libraries across many software components. For example, if we identify a library has a bug or security vulnerability, we need a simple and fast way to find all currently impacted consumers who still depend on the library with the issue. 

So, how do we answer this question of “Who depends on me?” in an efficient manner? With our dependency graph, we can very quickly find all software components that a module depends on or all software components that consume a given module. However, is that sufficient to provide the real answer to the question of which other consumers are currently using a library? The answer to this question is actually more nuanced than it seems on the surface.

In this instance, “currently used” is the key and makes the query more interesting. What exactly does “current” mean?

At LinkedIn, we use the concept of Last Known Good (LKG), which refers to the latest published version of an artifact that has passed all the validation steps. In most cases, it is the same as the last commit (HEAD) of the trunk, but in the case that HEAD fails a test or is identified as “not good” after being published, LKG could be slightly older than HEAD. The concept of LKG plays an important role in how we define the semantics of “currently depends,” as described below.

Let’s say a given library A with a set of published versions (V1, V2... Vn) has a consumer B with a set of published versions (U1, U2... Um) that depend on certain versions of A. Now, we define B as “currently depending” on A only if the LKG of B depends on any version of A. If, on the other hand, none of the versions (Ui) of B that depend on A are the LKG (i.e., Ui≠LKG), then we would say that B is not “currently depending” on A and A is not currently impacting B.

Therefore, the question becomes “Who are all the consumers whose LKG version depends on any version of A?” The queries to answer such questions are very expensive, because unlike the query to get the consumer of A at a specific version, the version of A to query won’t be specified. On the other hand, the result should only include the consumers at their LKG version.

library-consumers-illustration

Figure 1 illustrates an example of a software library A and its consumers B, C, D.

To answer such a question for A, the input of the query will be only A, and the expected result would be a list of tuples: ((B, lkgb, 1.1), (C, lkgc,1.2)), which means B:lkgbdepends on A:1.1, and C:lkgc depends on A:1.2. There is D:3.0 which depends on A:1.2, but 3.0 is not a LKG of D, so D is not in the result.

This turns out to be a very common use case required by compatibility tests in our CI/CD pipeline, or in the instance where we want to identify consumers of critical libraries, as we described in the beginning of the post.

Inefficient approach

Since our dependency service can return all consumers of software component A at a specified version, if we want to compute the result of our query on the fly, one option for the algorithm would be:

    For v in all versions of A

       For C:u in all consumers of A:v

            If the u is the LKG of C

                Add (C, u, v) to the result set

The time complexity of this algorithm would be O(n×m) where n is the number of versions of A and m is the average number of consumers of every A:v.

If we want to precompute the results and cache them to be able to answer our query quickly, then the complexity would become O(n×m×p), where p is the number of software components. At LinkedIn, p is currently more than 25K, so n×m×p≈ 2.2 billion. It is obvious that we need a more efficient algorithm.

Efficient approach

The inverted index technique is widely used in search engines to build the mapping between content and its location in all the documents. We already use a similar technique to build the mapping of a library at a specific version to all its consumers in the dependency backend service, which is explained below:

Let’s define the result map as a mapping between:

    (dependency name, dependency version) ⇒ List of tuple (consumer name, consumer version),

When importing the dependency graph of P:u, we run the algorithm such as following:

     For dependency D:v in all dependencies of P:u 

         merge (P, u) to the value of (D,v) in the result map

After the inverted index of consumers is created, to answer the question of who are the consumers for any library D at version v, we simply look up the value of the (D,v) from the result map.

Now we need to go further, to build the mapping of a library, without specifying a particular version, to all of its consumers at their LKG.

Let’s look at Figure 1 again. We see two types of dependencies.

    Consumers at Non-LKG: B:1.0→A:1.0, C:2.0→A:1.1, D:3.0→A:1.2

    Consumers at LKG:         B:lkgb→A:1.1, C:lkgc→A:1.2.

We won’t give much attention to dependencies whose consuming versions are not LKG, but we do care about the others. By solely looking at the dependencies of the software component at LKG, we can derive the result map: A ⇒((B, lkgb, 1.1), (C, lkgc,1.2)) , where the mapping is: 

            dependency name ⇒ List of tuple (consumer name, consumer LKG version, dependency version), 

Therefore, we can have a more efficient algorithm, such as the following:

    For P in all software components

        For dependency D:v in all dependencies of P:LKG

            merge (P, LKG, v) to the value of D in the result map

After the inverted index is created, to answer the question of consumers at LKG for any library D, we simply look up the value of D from the result map.

With this algorithm, the time complexity to pre-compute the results of all software components is reduced from O(n×m×p) to O(d×p), where d is the average number of dependencies of a software component at LKG, and p is the total number of software components at LinkedIn. In practice, it is a reduction from 2.2 billion to 2.4 million, almost 1,000X faster!

Implementation

To store the inverted index, we could build an in-memory cache, or persist to a distributed cache service. If the cache can be rebuilt quickly enough, we could simply build it in the memory whenever the service is restarted.

The advantages of the in-memory cache are:

  1. No service dependency to the external cache service

  2. Faster response time

To achieve the fast rebuilding of the entire cache from scratch when the service is restarted, we must parallelize the work.

cache-pipeline

Figure 2 illustrates the pipeline to build the cache.

We created a pipeline that consists of two query queues and two pools of parallel workers. The input to the LKG query queue is the name of the software component, while the input to the dependency query pool is a pair of (software component name, LKG version). The LKG query workers parallelize the queries to get the LKG version of each software component, and then pipe the result immediately to the dependency query queue, where it is picked up by the dependency query workers in parallel. The result of the dependency queries will be piped in to the inverted dependency index builder to build the inverted dependency index cache in parallel.

As of today, we can rebuild the entire inverted index cache in memory, consisting of 19K keys and 2.4M tuples, in about one minute. Once the cache is built, the dependency service periodically monitors the change of LKG of any software component, and updates the affected cache entries accordingly. Because the number of software components whose LKG changes between each cache update cycle is relatively small, this system of updating the cache is much faster than rebuilding the entire cache, typically taking less than 10 seconds.

Generalized framework

There are, of course, other types of consumers than those at LKG. Sometimes, we care more about the consumers that are actively deployed, which may not be the ones at LKG. Can our design of building the inverted index cache for the consumers at LKG easily adapt to such applications? The simple answer is, “yes.” 

The design can be generalized further to support any type of consumer that is defined by two functions: F() and D().

Let’s define F(p)=(v1, v2, ...vn) where p is a software component and v is the version defined by the function. The example of F(p) can be LKG(p), Deployed(p), etc.

Let’s define the function to get the dependencies of a software component as D(p, v) = ((d1, v1), (d2, v2),...(dn, vn)) where p is a software component, v is a version of this software component, and (di, vi) is a dependency module di at version vi.

To generalize the design, the workers of the first queue need to implement the function F(p), while the workers of the second queue need to implement the function D(p, v).

pipeline-design

Figure 3 illustrates our generalized pipeline design.

Moving ahead

Finally, with the introduction of a consumers at LKG cache, our library owners are able to track the use of internal or external libraries with a simple and fast query to our dependency service. With the success of this implementation, we are now moving to a generalized implementation, as described above, to support other similar queries that our users and our CI pipeline need answers for. LinkedIn’s engineering development requires very efficient and sophisticated dependency management; this work is just one part of that journey of continuous improvement.

Acknowledgments

Work on this feature would not be possible without the support from Irina Issayeva and Deep Majumder. We would also like to thank Nikolai Avteniev and Yiming Wang for their effort to review and give valuable feedback to this blog post.