Managing iOS Continuous Integration at Enterprise Scale

December 7, 2015

Imagine that you are a skilled iOS engineer working on one of the most popular apps in the App Store (perhaps Linkedin apps!) You start your day with coffee, confidence, and a cool feature to work on -- one that will impact millions of users. You follow best practices, writing beautiful Swift code and getting excellent coverage with your delicate tests. A couple of hours later, you do your last test run in Xcode and submit the code change. But then, you're notified that your build or tests fail with some error that you swear is totally unrelated to the changes you made! Infuriating.

At LinkedIn, we strive to make iOS builds reliable and fast to make sure this doesn't happen. To support our large iOS development ecosystem which has hundreds of engineers and tens of applications, the Continuous Integration (CI) process is done on a centralized build cluster with hundreds of Mac Minis. We built Gradle plugins on top of the native build tool ‘xcodebuild' and installed multiple versions of Xcode on each node to support different applications. Most automated operations are invoked via the command line on Hudson, versus the developer using Xcode IDE directly. Build and tests are distributed into multiple jobs in parallel to achieve maximum speed.

However, the last few Xcode releases caused a good amount of failures to our automated system. Flakiness in the CI pipeline affects the trust in the system, and magnifies its impact at scale. Troubleshooting Xcode build issues can be painful and is not creative because debugging information is not transparent. In this post, I will share the different approaches we take when tackling these problems as well as a few workarounds for people who are experiencing the same issues.

1. The app crashes when launched in the simulator

This issue was seen in Xcode 7.0.1 with runtime iOS 8/9.

The symptom is that the xcodebuild command will exit with **BUILD INTERRUPTED** with no obvious error message when compilation finishes and it's about to launch the app in the simulator. This happens very frequently, but sporadically. In the worst week, 15 percent to 20 percent CI builds failed due to this. The number of commits per day across all iOS projects was between 150 to 200 so the impact was significant. Without any doubt, it brought a lot of frustration to our developers.

After checking the system.log of the simulator (found at ~/Library/Logs/CoreSimulator/<Device UDID>/), we found these error messages:

There was no clue to help understand why the service was killed with signal 9. So we posted this issue to the Apple developer forum and got a helpful answer:

There is a potential race condition between install and launch, and it looks like you're reliably losing that race. SpringBoard sends SIGKILL to a running process when its app bundle is updated, but in this case, we actually launched it after the install before SpringBoard got notified of the install.

The issue should be alleviated when using the iOS 9 runtime, but it is easily hit with the iOS 8 runtimes.

Please try using Xcode 7.2 beta as it introduces a 1s stall after installation in order to help avoid losing the race with older runtimes.

When faced with this problem, adopting Xcode 7.2 beta was not an option as it wouldn't be timely enough. From the reply, we came up with the following workarounds to address this issue:

Start another thread along with the xcodebuild process to capture this particular error message by parsing system.log, and kill the build process once we detect it to avoid further hanging.
Given this race condition is caused by installation and launch, retry the command after the termination from Step 1. It’s based on the assumption that the application will not need to be installed again in the 2nd attempt, which can avoid this race issue.

After applying these combined changes, this failure was successfully eliminated, with the additional cost of increasing build time due to the retry.

We also learned another solution by running this sequence of commands, rather than just invoking “xcodebuild test” directly:

With this approach, installation is done separately so we have more fine-grained control to avoid the race condition. This has not been fully adopted and verified in our build system yet, but it shows promising results in some sample builds.

2. Unable to find simulator device

We came across an error like the one below on certain machines:

In the meantime, command “xcrun simctl list” did not show any devices:

Although we could quickly identify and put these problematic machines in quarantine, we still needed a permanent solution to fix this issue. We tested several ideas but none of them worked. Re-provisioning the machines could possibly help but that was the last thing we wanted to consider testing.

After reading StackOverflow posts and documents, we found that this issue could be resolved by running this sequence of commands:

Do 'launchctl list | grep CoreSimulator' and then remove any processes associated with CoreSimulator service. For example, 'launchctl remove com.apple.CoreSimulator.CoreSimulatorService.117.15.1.dltcnwGXJZL4’. Note: command ‘killall "Simulator”’ will do the work after Xcode 7.
Delete everything under ~/Library/Developer/CoreSimulator/Devices
Run 'xcrun simctl list' or anything else that starts CoreSimulator service. Then all devices will be created.

Step three is dependent on the first two steps. Only killing CoreSimulator process or only deleting devices folder will NOT invoke simulator re-population.

After applying these operations in the beginning of each CI job, we stopped seeing missing device issues. This turned out to be a reliable solution to get a clean slate of simulator devices.

3. Simulator timed-out in test

We saw this error message “iPhoneSimulator: Timed out waiting 120 seconds for simulator to boot, current state is 1.” when running the ‘xcodebuild test’ command. It does not indicate any legitimate issues so we simply retried the command every time we encountered it. The second attempt would successfully bring up the simulator and pass through all tests.

But retrying could be expensive given our large code base. It also brings a lot of confusion to developers who look at the log. To address this issue, we turned to StackOverflow. According to this comment, simctl is not the right way to start a simulator which might cause this problem.

In Xcode 6 and Xcode 7, Simulator.app is responsible for booting the device it uses. If you use simctl to boot the device, it will not be usable by Simulator.app in that state because it will be booted to a headless state.

A suggested way is using open command instead:

However, “open” command was not working for us initially. It cannot open the right simulator given the target device UDID. For example, when launching an iPhone 4s under iOS 8.4 by giving UDID “C3578B00-BA0A-48CF-AB6B-EACA6B4FFAB2” (according to the sample output below), we see an iPhone 6 is opened.

Then, we realized that a full path of Xcode is needed as we had multiple Xcode installs on each Hudson node. So we tweaked the open command into something like this and it worked perfectly:

We updated our workflow to invoke this command before running any tests, and it successfully resolved this issue and saved approximately 10 minutes for each of our builds.

Conclusion

Managing automation environments for many different platforms at large scale is an intense challenge, especially given that the iOS build system is proprietary. At the same time, providing a reliable platform is essential to move fast and with high quality. We hope these scenarios and solutions will be useful to many. Our iOS developers should be able to commit their code after finishing their diligence with zero worry behind.

Acknowledgements

The solutions and workarounds would not have been discovered without Jarek Rudzinski's contributions. A special thanks to Keqiu Hu, Jens Pillgram Larsen, Deep Majumder, Jacek Suliga and Wei Chen for reviewing and contributing to this post.

Disclaimer

We are users, not the creators of Xcode build system so everything stated above is from the perspective of operational observations and experiments. The workarounds we found and discovered are not verified with Apple.

Topics: Developer Experience/Productivity Product Design