3x3: iOS Build Speed and Stability

Keqiu Hu

Engineering at Databricks

April 7, 2016

At the beginning of last year, alongside the development of Project Voyager, LinkedIn’s new flagship mobile application, we started practicing our 3x3 philosophy:

Release three times a day, with no more than three hours between when code is committed and when that code is available to members.

While we can’t ship to the App Store every three hours, we can deliver a new enterprise build to LinkedIn employees multiple times a day. “3x3” sounds quick and easy, however, this is much easier said than done. With a three hour release cadence, there is no room for manual testing, so we need a pipeline that is:

Completely automated through every step, from code commit to production release
Reliable enough to cover even more than what can be done in manual testing
Fast enough to fit into our three-hour window
Stable enough to ensure a legit commit can always be published

Earlier this year, Drew Hannay gave a quick overview of mobile 3x3. Following up on his post, 3x3: Speeding Up Mobile Releases, we will dive deeper into our 3x3 philosophy on iOS, with a focus on build speed and stability.

In order to achieve a pipeline that takes less than three hours from committing code to publishing it, we optimized our build pipeline by refactoring our Swift code to speed up compilation, speeding up our UI test frameworks and parallelizing our building and testing.

We chose Swift as our primary language for our new flagship application Voyager, and our project grew side by side with the evolution of Swift from the raw 1.0 era to beyond the 2.0 stage.

During the early phases of Voyager, we enjoyed the modern features and ease of use of the Swift language, but the long compilation time frustrated us. With more and more code committed to our iOS code base (about 60 commits every day), we found the build times increased exponentially with the number of files, from seconds to around 30 minutes. Release build times were even worse than debug build times, often taking two hours. With the mantra “test development must be concurrent with app development,” we would write about 150 KIF UI tests, which took another two and a half hours. As a result, it would take us four and a half hours from commit to publish!

To ease developer frustration, we diligently communicated with Apple engineers through Apple developer communities, Apple Developer Forum, and company partner relationships, and received lots of valuable feedback on how to expedite our building time. One major code base refactor led by Jacek Suliga divided our project into different sub projects and modules, and replaced implicit type references with explicit ones. This cut debug build time by almost half and reduced release build time from two hours to 35 minutes and cut the overall commit-to-publish time to around three hours. For local development, we bought each iOS developer a Mac Pro to further cut the debug build time.

With build time optimized, the bottleneck then remained in the UI testing time. In UI Automation: Keep it Functional - and Stable, Jacek discussed how we optimized the open-source KIF project and made it five to 10 times faster. This bought us a significant improvement in testing time and cut the UI test running time by more than 80 percent to around 20 minutes. However, with more and more tests checked in, the number of tests grew fourfold in one month, and the time for testing them increased to 80 minutes, growing overall time to 2 hours.

After the compiler and UI test framework were optimized, we decided to focus on an innovative technique to improve commit-to-publish significantly: distributed building and testing.

We worked closely with the build infrastructure teams to roll out distributed building and testing support for iOS and Android. After we distributed all building and testing jobs to 10 machines, the only bottleneck was the release build time which took 35 minutes to compile, plus some tooling overhead. The overall commit-to-publish time dropped to around 45 minutes. Using more machines increased the build speed, but it also introduced more reliability issues which I will address in the stability section.

In our build pipeline from committing to publishing, there are two major components affecting the overall stability: testing infrastructure stability and build tooling stability.

In my previous blog Test Stability - How We Make UI Test Stable, I discussed how we made our UI testing environment stable. Primarily by stabilizing testing environment, removing unpredictability from testing frameworks, and sanitizing our testing suite. Please review that blog post for details on how we made our testing infrastructure stable.

2.1 Hardware stability

Our machine pool is centrally managed via cfengine and we currently have two types of Apple hardware:

Mac Mini (Late 2012)

Processor 2.6GHz Quad-Core Intel Core i7
Memory 16GB 1600 MHz DDR3
Storage 1TB HDD
Intel HD Graphics

Mac Mini (Late 2014)

Processor 3.0GHz Dual-Core Intel Core i7
Memory 16GB 1600MHz LPDDR3 SDRAM
Storage 512GB PCIe-based Flash Storage
Intel Iris Graphics

For a while, we saw memory issues in our build pool and around 10 percent of the pool had less than 100MB active memory and lots of swap usages. When we examined the low-memory build machines, we found more than 20 stale xcodebuild processes in the background. We realized that this was caused by previous jobs’ hanging xcodebuild runs.

To resolve this problem, we took the following actions:

Added retry logic to clean up all stale xcodebuild processes before running new jobs.
Regularly restarted system services that may leak memory daily.
Set up auto alert for high swap usages inside the pool.

2.2 Mac OS build environment stability

Besides memory issues, we occasionally saw machines running into iOS simulator issues and those iOS simulator issues are detrimental to trunk stability. Once a machine starts having simulator issues, it will affect all subsequent builds on the same machine if not handled properly. We have made tremendous efforts to solve simulator problem (see Managing iOS Continuous Integration at Enterprise Scale) but we realized that it’s more efficient to work around it rather than attempting to fix the problem, given that the simulator tool is not under our control.

Luckily, we found out that rebooting could be the panacea for most simulator issues. We started a project called PoolGuardian to monitor all of our build machines and periodically check if a machine experienced simulator issues. Once an issue is found, it will reboot the machine and run some sanity checks before bringing it back into rotation.

2.3 Parallelized testing stability

Distributed testing seemed promising as it did solve our speed puzzle, but we realized it was a double-edge sword when we experienced a severe reliability problem.

In our Mac OS tooling environment, we ran into several Apple Developer Tooling issues, including intermittent compiler crashes, simulator crashes, duplicate simulators error, and corrupted environment. Overall, we had an average tooling stability of around 95 percent per machine. This looked good, however, with distributed testing, a build can only pass when all its child jobs passed, more specifically, all 10 child jobs must pass. So each additional node adds a 95 percent multiplier to the overall reliability and exacerbated the tooling flakiness exponentially with the number of additional nodes. With 10 nodes used, the tooling reliability drops to 95%10 ≈ 60%.

In order to improve tooling reliability, as mentioned in previous sections, we optimized our building pool to make hardware much more accountable and stabilized our Mac OS X through all viable means. And as a result of this, the tooling reliability increased from 95 percent to slightly above 98 percent. If you add the exponential part, it is still 98%10 = 82%. This is far from our expectation—a reliability of 99 percent. However, solving x10 = 99% gives us x = 99.9%, which requires an average machine reliability of 99.9 percent and this requires more resources and effort.

In the meantime, we tried to think out of the box to reach a high build reliability while keeping a fast commit-to-publish pipeline. Then we had a breakthrough: what if we collapsed all testing nodes to one machine and ran tests in parallel by starting multiple simulators? That was the idea of Project Hydra. The high level goals of Hydra are:

Run tests in parallel on multiple simulators in one machine to make build times five to 10 times faster.
Enable interactive testing between different simulators (iOS to iOS, Android to Android and iOS to Android).

In this blog post, I will only focus on the first target. Below is a simple diagram of the architecture of a multiple-simulator test runner.

When the post-commit job starts, the test runner would build the application and bootstrap five simulators to form a simulator devices pool. After the build is completed, the test runner would query the test bundle to get a list of test classes. Based on the number of test cases, the test classes are dynamically allocated into 20 buckets and form a test targets queue. After the initialization phase finishes, the device pool will keep polling test target buckets to run tests. In the meantime, the device pool has a self-sanitizing daemon to keep the device pool healthy. By the end of a test job, artifacts, and test reports are collected and uploaded to our Continuous Integration Archiver, awaiting the next release cycle.

One question you may ask is why we chose to run five simulators. That is the default OS process limit per shell session, which is 709 on the Mac Mini boxes. Each simulator would spawn around 70 processes. When we add the 300 normal MacOS system processes, we can boot around five to six simulators at the same time. We tried to increase the shell process limitation to infinite, however, with more than 700 processes running, the OS environment becomes extremely fragile. As a result of this, we stuck to five simulators for an optimal trade-off between scale and reliability.

With Hydra, our build stability improved from a lowest zero percent to a current 96 percent and we are still improving it. Below is a demo, and we are also on our way to open sourcing this project. Stay tuned!

We could not have accomplished all these without a dedicated collaboration across LinkedIn’s teams including Tools, Testing, Mobile Infrastructure and more. Thank you all!

Topics: A/B Testing/Experimentation Product Design