Productivity at scale: How we improved build time with Gradle build cache
October 1, 2019
Editor's Note: This is the second in a series of posts describing how we improved productivity at scale—both in terms of lines of code and number of engineers—at LinkedIn. In our first post of the #ProductivityAtScale series, we shared details on how we improved build time by 400%. This post covers how we continue to improve productivity with Gradle build cache.
At LinkedIn, developer happiness and productivity are key focus areas in all tooling solutions. We strive to build solutions that improve productivity, but this can quickly become challenging given the growth rate of our product codebases. For example, LinkedIn’s largest Play application is an API service, named Voyager-API, that supports traffic from the LinkedIn website and mobile clients. It handles over 130,000 requests per second, and this scale only continues to grow—the number of lines of code for this application grew by 50% over a single year.
Enter Gradle build cache, which enabled us to significantly boost developer productivity and happiness. In the cases of LinkedIn’s two largest Java-based applications, Voyager-API service and the LinkedIn Android application, we saw drastic reduction of 30% in build time. Also, Voyager-API saw a reduction of an additional 40% in test execution time and a 25% reduction in Integrated Development Environment (IDE) refresh time, which is time spent integrating changes and downloading the latest libraries.
In this post, we’ll look at our journey with Gradle build cache.
What is build cache?
Per Gradle documentation, “The Gradle build cache is a cache mechanism that aims to save time by reusing outputs produced by other builds. The build cache works by storing (locally or remotely) build outputs and allowing builds to fetch these outputs from the cache when it is determined that inputs have not changed, avoiding the expensive work of regenerating them.”
Build cache server: In 2018, we began using Gradle Enterprise as the cache server. Gradle Enterprise offered very useful features that helped us optimize builds and leverage caching. We’ve operationalized Gradle Enterprise by wrapping it as a LinkedIn service that enables us to do fully automated installation, upgrade and recovery at scale plus real-time service monitoring and alerting.
Enabling a cache server at LinkedIn was not sufficient by itself to reap the benefits of Gradle’s build-cache technology. We needed to optimize builds and fix Gradle plugins and tasks so that they were ready for caching. That work required significant insight into Gradle builds for all of our software projects. We used Build Scans and an in-house build metrics system to optimize builds for caching.
Build scans: A build scan is a shareable record of a build that provides insights into what happened and why. At LinkedIn, we heavily depend on build scans. A build scan is produced with every build and developers use it to understand build performance, debug build failures, and troubleshoot build-cache misses.
Build Metrics System: The build metrics system is an internal system designed to provide insight into every Gradle task across multiple builds, and all software projects at LinkedIn. Gradle Enterprise is great for visualizing and investigating individual builds. However, with more than 100,000 Gradle builds per day, we needed a “big data” system to record, track and analyze build performance and cacheability. We used Gradle Enterprise’s Export API and piped all of our build data into our metrics system.
LinkedIn’s largest Play application, Voyager-API, was one of the first services to start using Gradle build cache for every build. Last year, we shared our journey to accelerate build time for this Play application by 4X. We continued to invest in improving build speed to keep up with the growing codebase. It took us 3 - 4 quarters to get to the point where more than 1,000 software projects actively use the build cache in every build and benefit from an average speed improvement of 30% (when compared to builds that do not use the cache).
The following were some of our biggest challenges faced throughout our implementation:
Debugging cache misses: One of the key challenges we faced was around debugging and identifying the causes behind our cache misses. Even though Gradle provides tools to debug, we had to put on our detective hats and be innovative. One interesting case was how we identified a race condition in an internal library that used the JacksonDataCodec pretty printer. Even though this internal library generated the same content every time, indentation would vary due to the race condition. The result was a different cache object for every execution.
Rewriting existing plugins: Since LinkedIn had been using Gradle since 2011, several old Gradle plugins already existed and needed to be optimized. Old Gradle plugins and tasks were not developed with build caching in mind. Per Gradle documentation, Gradle takes a new fingerprint of the inputs and outputs before the task execution to determine cacheability/up-to-date check. The fingerprint contains the paths of input files, a set of output files, and a hash of the contents of each file. If the new fingerprints are the same as the previous fingerprints, Gradle assumes that the outputs are up-to-date and skips the task. If they are not the same, Gradle executes the task. We encountered plugins which generated files with timestamps, SCM revision numbers, and absolute paths. These generated different cache objects as the file contents changed with every execution. Thus, these Gradle plugins had to be rewritten.
Limited documentation: When we started using Gradle build cache in 2018, documentation was limited, especially around the causes of cache misses and how to fix them. This meant we spent a lot of time debugging root causes instead of implementing actual fixes. Fortunately, we were able to work closely with the Gradle core team to resolve them, and have posted many of the interesting bugs/issues we encountered. Over time, Gradle (and Gradle Enterprise) has improved its documentation and it now provides very useful debug information in the build scans.
Zero interruption development: Another interesting challenge was around integrating changes without disrupting a large development team that ships to production multiple times per day. Hundreds of engineers actively work on numerous codebases behind LinkedIn. We used step-by-step ramping strategies to integrate changes.
Overall, this was a complex, cross-organizational project that involved working with the owners of many codebases, understanding their build automation, fixing dozens of different Gradle plugins, and working with individual teams to migrate their custom build scripts.
The following are some of the key outcomes of the Gradle build-cache project:
Faster builds: Reusing the Gradle task outputs from remote and local executions meant engineers only spent time building the small delta of code changes. Our CI builds write to and read from the remote cache. Local builds read from the remote cache server and also write to and read from the local cache. This improved build times by a huge amount. As of this post, more than 4,500 services are using Gradle build cache and we save more than 800 hours every day in CI and local builds.
Productivity boost: Reusing the Gradle task outputs improved productivity across the board. Build was improved by 30%, test was improved by 40%, and IDE refresh times improved by 25% on some of our applications. These productivity gains helped us bring delightful features and fixes to our engineers more quickly. The developer happiness that resulted from this has been echoed multiple times throughout our internal surveys.
Upstream contributions: LinkedIn was able to submit upstream fixes back to the open source Gradle repository, such as making PlatformScalaCompile cacheable (gradle/gradle#3804) or caching the JavaExec task by Java version (gradle/gradle#6711). We hope these fixes will help anyone using gradle build cache.
Reduced debugging time: We also developed Gradle plugins to assist in debugging cache performance. These plugins helped us identify the most common cacheability issues, like overlapping outputs or changing inputs/classpath. This saved us a lot of time in debugging. Recently, Gradle (and Gradle Enterprise) has included similar debug information and reporting in build scans. We also developed integration fixtures to test the cacheability of individual tasks that have helped to sustain the cacheability of tasks.
Gradle build cache has reduced our build time tremendously. As of this post, the cache hit rate is 60% for most of the applications. Our next focus is to:
Improve the build cache hit rate by further enhancing Gradle plugins.
Improve the build cache share-ability across OS types. Our engineers develop on Linux and macOS interchangeably, and right now, we see a huge difference in the cache hit rate between the two OS types.
This project would not be possible without the support and contribution of the Build Tools and Engineering Productivity team at LinkedIn: Szczepan Faber, Yiming Wang, Chong Wang, Devi Sridharan, Vinyas Maddi, Theodore Ni, Mihir Gandhi, Deep Majumder, and Irina Issayeva. We would also like to thank Evgeny Barbashov and Jie Li for their contributions to Gradle Enterprise integration and build metrics system. And, of course, a huge shout out to the amazing folks at Gradle, Inc. for their brilliant work and continued support.
Stay tuned! Part three in this series will describe how to set up Gradle build cache for Android applications. If this sounds interesting to you, we’re looking for great engineers to join our Productivity & Build tools team.