Infrastructure

LinkedIn’s journey to Java 11

Introduction

At LinkedIn, we are committed to deliver a best-in-class platform experience for our members. One of the technologies that we use to do that is Java, an object-oriented programming language that produces software for multiple platforms. We are a huge consumer of Java, running over 1000 Java applications on 320,000+ hosts. Based on the promised performance improvements of the G1 garbage collector (G1GC) and other well-documented benefits captured in this article, we made the decision to begin our journey from Java 8 to Java 11.

In this blog post, we will discuss LinkedIn’s migration to Java 11 from preparation to completion analysis. We will cover the performance testing and the results received before and after the migration, the timeline used, and the lessons we learned along the way.

About Java

Java has evolved a lot over the years. Aside from the constant development of new features, there have also been major changes in the new version release process. Originally, there was a new major version released every couple of years or so. After Java 9, the release schedule changed to a new major version every six months. There’s also a Long Term Support (LTS) version of Java specified every few major versions— Java 8 was the last LTS version before Java 11 was released in September 2018. 

Preparation

LinkedIn started to look into Java 11 in late 2018. At the time, Java 9, 10, and 11 were not super popular in the community yet. As an anecdote, some sessions at the Oracle Code One conference in late 2019 asked attendees if their products were using Java 9 or higher to which only about 20% of the room said that they were; few major companies had adopted Java 11 either.

At LinkedIn, applications use one of four main types of frameworks: Jetty, Play, Samza, or Hadoop. Of these, only Jetty was at a version (within LinkedIn) that was compatible with Java 11, so this blog will mainly cover the migration of our Jetty apps. Thankfully, Jetty apps alone account for 1,000+ microservices and 60%+ share in production.

Preparation can be split into three parts: upgrading the build framework, doing performance testing, and automating the migration.

Build framework
The first step of preparation was to have our code built with Java 11. At LinkedIn, we use a multi-repo strategy so each application and library is built separately and has its own repository. This means that over 2,000 repos (applications + libraries) needed to have their builds changed to Java 11. 

For our build systems, we needed to upgrade to Gradle 5 or above to be compatible with the Java 11 upgrade. Thankfully, our build tools team was well aware of this requirement and had already begun working to migrate builds to Gradle 5.

After seeing that it was possible to get our applications up and running with Java 11, our team quickly transitioned to early adopter testing. 

Performance testing
We picked 20 applications as early adopters and tested them with Java 11, mostly with the G1 garbage collector since G1 is used in 80% of our applications (with CMS making up the majority of the remaining 20%). Our selection criteria prioritized the applications with larger heap sizes, higher host counts, and higher QPS, with both stateful and stateless applications being selected. Overall, the results were positive with no applications seeing performance degradation after upgrading to Java 11. The best cases showed performance gains (in terms of latency and throughput) of up to 200%. These performance gains were mostly seen with G1, Shenandoah, and ZGC. CMS performance did not change. It’s one thing to hear about improved JVM performance but it was another thing entirely to learn and witness it first-hand on our wide range of applications.

Here are some results of performance tests that we conducted:

graphs-showing-improvements-for-latency-and-throughput-of-an-apache-pinot-workflow

*Drastic improvements for latency and throughput of an Apache Pinot workflow

performance-showing-comparison-to-java-8-G1-for-brooklin

*EBWR, EBPR and EPR are all throughput metrics. Java 11 G1, ZGC and Shenandoah all perform extremely well in comparison to Java 8 G1 for Brooklin.

As a side note here, we tested some applications with ZGC and Shenandoah, which were experimental and not for production usage, and saw that some applications performed exceptionally well with these collectors. These results helped us confirm our desire to move to Java 11.

Automation

In addition to changing the build processes for over 2,000 repositories, it is necessary to change the runtime for over 1,000 applications, which is a lofty goal for a very small working group. Thankfully, automation really saved us here! We were able to automate many of the changes needed to migrate to Java 11. Although this did not give us a 100% success rate, automating any percentage of the repositories’ migration to Java 11 would make the workload much more palatable.

After some minor changes to our infrastructure, it was possible to change repository build systems to use Java 11. We then were able to trigger mass Java 11 builds in a test environment to find out what issues needed to be addressed. This was, without a doubt, one of the most important features and learnings that we had in the Java 11 migration. This testing allowed us to identify a plethora of edge cases as well as several major challenges. Here are some of the major challenges that we identified:

JDK cross-compatibility issues
The first and most pressing issue was the cross-compatibility between Java 8 and Java 11. We realized it would take multiple years to complete this upgrade for the company and that means that we would be in a transition state where both JDK 8 and JDK 11 would be used for a while. LinkedIn runs with multi-repo source control, which meant that we needed to ensure every repository can work for both Java 8 and Java 11 upstreams. The reason we call this cross-compatibility and not backwards-compatibility (because of bytecode level) is that we also found cases where code could be compiled on Java 8 but failed to run properly on Java 11. These cases include the removal of JavaEE libraries, changing the default classloader type, and stricter class casting in Java 11. 

We found that there were too many of these issues to address individually so we decided to use the “--release 8” flag in order to make the Java 11 compiler compile down to Java 8 level bytecode as well as to restrict the usages of the new APIs. The downside of this is that new APIs and language features, like Set.of(), and the var keyword cannot be used. However, the upside is that we were able to maintain compatibility between Java 8 and 11 much easier, a tradeoff that the team unanimously agreed on.

Removal of libraries
JavaEE libraries were removed from JDK 11, but they were widely used in our codebase. Many of these libraries have open source replacements

We had to make a decision here about whether to have repo owners manually replace instances of these libraries or to add it into our build toolchain. We decided that the cost of removal for these usages was too high for our working force so we decided to add a static final version of the JavaEE libraries to the build toolchain by default. These libraries are relatively lightweight so it wasn’t a big deal to patch them in. 

JVM option changes
JVM options also changed quite a bit between Java 8 and 11. Several options were made obsolete and other options were deprecated in favor of newer options. For most options, we used an open source service called JaCoLine that helped remove obsolete options. GC Logging options is one of the set of options that received a major revamp due to JEP 271. After realizing that the logging would look completely different and there wasn’t always a good mapping between old and new GC logging options, we decided to just create a default option and asked users to modify it if needed. 

That being said, unified GC Logging is another strong reason to move past Java 8. It makes reading GC logs significantly easier and it’s a feature that can be leveraged to streamline lots of tooling.

Internal dependencies
LinkedIn runs on a microservice architecture. This means there are many repositories that are linked to each other through a dependency graph. The challenge here is if a dependent repo is not finished migrating, it may block a dependee repo from migrating because a dependent repo may need changes to be compatible with Java 11. This is not an easy problem to solve. By using some graphing algorithms on the dependency graph, we found that the targeted applications had more than 25 levels of dependencies. We wanted to encourage the lower level of dependencies to migrate first but following a strict ordering would restrict the migration velocity. 

In the end, we decided to use rough bucketing to basically split the migration into three parts. During each part, applications around the same level in the dependency graph would be migrated. This was the compromise we made between correctness and velocity, allowing most applications to not be blocked at all by dependencies, while maintaining a decent migration throughput. Learning about our dependency graph was certainly key in making an informed decision about how to do this.

After dealing with these roadblocks and more, the infrastructure changes and automation fixes were tested iteratively using our infrastructure’s dry-run testing mechanisms until we managed to automatically migrate about half of the library repositories (~500). We applied the automation to applications as well but did not attempt to commit it as we required owners to still do runtime validation. This runtime validation included both functional and non-functional constraints. 

There were more problems than we had previously anticipated and we realized that several of these changes would need to be addressed going forward with future major Java version upgrades. Therefore, it was imperative to spend some time building quality infrastructure that we could reuse and now that Java 17 has arrived, we couldn’t be happier that we did! 

All in all, preparation for the migration including early adopter testing, infrastructure changes, building automation, and automatically upgrading 500 libraries took three quarters. 

Migration

The actual migration was planned for an additional three quarters in which 500 libraries and about 1,100 applications would be migrated to Java 11, led by a team of two engineers and one Technical Project Manager. 

Thanks to our thorough pre-migration testing and automation, we did not see too many issues throughout the migration. Preparation really does pay off! Most teams were able to finish their migration within a few hours. 

However, we did see a couple of common runtime issues:

One challenge we faced was some applications suffering in GC performance due to the Java process having fewer GC threads because the JVM respects cgroup limits. Migrating to Java 11 exposed this issue in several applications that were basically taking advantage of LinkedIn’s soft limits (cpu.shares) where CPU cycles could be “borrowed” from idle “neighbor” applications on the same host. With cgroup limits being enforced, access to these cores were lost. In some cases, increasing the number of GC threads manually was required to maintain the same performance.

Another issue we saw with all Java 11 versions was a stark increase in off-heap memory usage. This did not seem to map down to any specific operation and seemed more like a fragmentation issue. Switching from the glibc memory allocator to either mimalloc or jemalloc helped tremendously with these issues.

Though these issues were a bit scary at first, it was nice to be able to dig down to the root cause, find a proper resolution, and to be able to share our findings in this blog post. 

During and after the migration, we tried to measure performance as well as we could. We built automation that leveraged our metric collection system in order to get a rough measurement of performance before and after the Java 11 migration. In total, we collected data from 200+ applications and found that Java 11 decreased P99 latency by an average of 10% and increased maximum throughput by an average of 20%. It’s worth noting that we did not change the GC type in the migration to reduce the degree of disruption and have a more fair comparison of performance. Hopefully, these numbers over a decent sample size can be helpful to readers.

In addition to the performance improvements, Java 11 also brings some runtime improvements like the now open-sourced JFR tool. Overall, this migration can be deemed to be very fruitful! 

Future work

There’s still a lot of work to be done. While Jetty is done, we still need to migrate three remaining Java application tracks to Java 11. Afterwards, it will be possible to enable full Java 11 bytecode with minimal effort. In addition, new features like ZGC, Shenandoah, and Project Jigsaw can all be experimented with to see if there are any benefits to be gained there. CMS is also deprecated in Java 11 and is removed in Java 14, which means that LinkedIn off of CMS usage will be another major initiative. Finally, Java 17 has appeared and needs to be in consideration going forward.

Acknowledgements

I’m very happy to be able to write this blog post and it goes without saying that this would not be possible without major support from many people at LinkedIn and Microsoft. First of all, thanks to Vivek Deshpande and Alex Dubrouski for participating heavily in the Java 11 working group. Thanks to my manager, Xialin Zhu, for being a guiding hand when needed. Thanks to our Technical Project Manager Andrew Ding for keeping everyone on track. And I’d also like to give a special thanks to LinkedIn’s Build Tools team, especially Kyle Moore and Yiming Wang, who helped consult on many of the build infrastructure changes that needed to be made for Java 11.