Test Stability - How We Make UI Tests Stable

Keqiu Hu

Engineering at Databricks

December 16, 2015

Test Stability - How We Make UI Tests Stable

If you know anything about building user interfaces (UI), you probably know that it's very difficult to test them using automated tests. Such tests tend to be "flaky", or unreliable, because of many different factors. While developing LinkedIn's latest iOS mobile app, we figured out how to stabilize automated UI tests, and that has changed the game for us.

At LinkedIn, we aim for high-speed, continuous delivery. Flaky tests are a big obstacle blocking us from reaching this goal. There was a time when we almost gave up trying to disprove the old truism that UI tests are always flaky. During the course of our efforts to speeding up continuous delivery, however, we managed to create a set of 700 UI Tests and 1,000 Unit tests running stably and swiftly, testing every code change committed to our code base, up to 80 commits every day.

In this article, I am going to talk about the flaky tests we encountered while building our continuous delivery pipeline, and how we tackled the problem.

Why Do We Test?

Why do we test at all? For several reasons:

To prove to ourselves that we’re doing things right. Testing gives everyone involved the confidence that code was written correctly, in all edge cases.
When we can be confident that we didn't break anything, it allows us to make changes quickly, speeding up our deployment and development (especially in the future).
It helps us quickly find "regressions" (bugs that were found previously).
When we find bugs or broken code quickly, we don't lock up "trunk" (the entire code repository we share), which allows everyone to check in code more often.

So, the benefits of having tests is obvious. But what if our only choice was to have unreliable tests?

Flaky Tests or No Tests?

When we started working on speeding up the delivery cycle of our new LinkedIn App, we found that our trunk was “red” half of the day (in other words, committed code broke one or more regression tests, and as a result, we locked up the repository to find the offending code commit). The team was working as hard as possible trying to fix existing flaky test cases and to make them stable, but we found that this didn't scale because people were committing more tests every day than we can fix. In the end, we paused and stepped back to reconsider our strategy.

The philosophy behind our test fire-fighting strategy and why we were trying to fix tests instead of just disabling them, was that flaky tests are better than no tests. If a test can pass 99 percent of the time, it can guard our code base from bad commits most of the time. Why should we disable it?

This sounds reasonable, until you do the math. Let’s assume that our test suite has 200 tests and each of our tests has one percent failure rate. Then, the possibility that our entire test suite can pass is only 13.4% (0.99 times itself 200 times).

Having 13.4% stability was unacceptable. Most commits would get reverted due to unrelated tests (we revert a commit if it cannot pass the test), resulting in frustrated developers and low morale.

A quote from Square’s blog also highlights similar sentiments: “Failures (trunk builds) above the one percent level are very visible and frustrating – even in a group of a couple dozen developers.”

Developers on our iOS team started to complain that testing was blocking their development. "Why not just disable it?" They asked. So we disabled all tests for one week and – ironically – the only difference was that our developers were more productive and happy.

ThoughtWorks also brought up an interesting point on its blog, "Stop calling your tests flaky. Instead, call it ‘Random Success’ or ‘Sometimes Success.’” Would you want to release a product that sometimes works? Would you want to use the LinkedIn App if it sometimes works? Of course not. Similarly, people would not care about our testing results if our test suite only sometimes worked.

Based on that, we decided that flaky tests were worse than no tests. In other words, if a test wasn’t stable, we would rather eliminate it from our test suite.

How Can We Make UI Tests Stable?

So, should we not write tests at all? When we proposed to eliminate all flaky tests, the testing team replied: "No Test, No Feature!” Flaky tests are worse than no tests, but we still need to have tests, and they need to be reliable. So, we were at an impasse. How could we possibly make testers happy without driving developers crazy? This is when we realized that we needed to dive deeper into the problem. As they say, necessity is the mother of invention. We needed to understand why tests were unreliable.

If we categorize the causes for flakiness, we find three major causes:

Flaky Testing Environment

When building the previous version of the LinkedIn app, we were running tests against frontend and backend servers, which meant that every request would go out and test a full loop from client -> server -> client like this:

If there was any flakiness in the server or network connection (any arrow is broken), our test would fail. So if a test failed, you would spend a huge amount of time trying to understand whether it was caused by server/network flakiness, test flakiness, or a bad commit.

To fix this problem, we borrowed the idea of Hermetic Servers from Google. In Google’s implementation, when the frontend receives a request, it does not call the backend server; instead, it calls a mock server. This reduces dependency on the network or on the backend. The fewer dependencies there are, the less potential for things to go wrong.

However, this design doesn’t eliminate all networking flakiness. In our experience, unreliable networking connections also contribute a considerable amount of flakiness to our building system. To make our client UI testing even more self-contained, we took one step further and integrated the mock server into our client.

Instead of talking to real servers (frontend and backend servers), we intercepted the out-going request and routed to our in-app fixture server, and returned fixture responses instantaneously. The idea behind this is modularized testing; for client-side testing, we only test client-side behaviors.

Flaky Testing Framework

We used KIF (Keep It Functional) as our iOS UI test framework. KIF is a great test framework for us for the following reasons:

KIF is written in Objective-C and co-exists with our client code. There is no extra layer between our app and testing framework, which greatly reduces any friction in debugging or writing tests by our iOS developers.
Fast. Compared with Selenium (our existing testing framework), KIF is lightning fast.
Easy integration with Xcode. KIF test cases subclass XCTest and fit into Xcode seamlessly. Also, this helps integrate KIF UI tests into our continuous delivery system.

When we started working on the project to rewrite the app, we had around 60 UI tests. However, when everyone started writing KIF tests, the number of tests shot up to around 250. But the ease of developing tests came at a price. The stability of tests dropped dramatically. As I mentioned before, if each test has a one percent flakiness rate, the whole test suite has little chance to pass. At some point, more than half of our commits were getting reverted (auto-revert the commit if the commit cannot pass tests), and many of them were innocent and had no reason to be reverted. So we made a call to stop running tests and focused on fixing the potential issues under the hood and KIF surgery was a big part of it.

We “forked” an in-house version of KIF (in the process of being open sourced) and made the following changes:

Disabled animation in KIF tests. Animation is slow and introduces lots of race conditions between app behavior and framework actions.
Hid APIs that were potentially flaky. APIs like explicit wait are potentially flaky since in different machines the transition times are different.
Added reliable wait APIs. Instead of waiting explicitly for a period of time, we added APIs to wait reliably by posting and checking checkpoints.
Fixed other existing unreliable APIs

After this exercise, we experimented with some tests using our new KIF test framework and achieved a tenfold speed boost. Even better, the tests became as stable as a rock.

Flaky Tests

With flaky tests and a flaky test framework resolved, next we revisited our testing guidelines. We gathered representatives from feature teams and re-emphasized our guidelines in writing future tests.

Tests should be useful: don't test trivial calls, like a simple assignment statement.
Tests should be maintainable: tests are written to be refactorable in the future.
Tests should be reliable: if you run the same test 1000 times, you should get the same result 1000 times. Add reliable wait for any transitions, async procedures, data loading and any other cases where you want to assert something after a change of app state.
Tests should be fast: each test case should as quick as a snap. If your test has a dependency on time-consuming procedures, mock it.
Tests should be complete. Edge and negative cases should be included in our tests.

By the time we revamped our testing framework and revisited our testing guidelines, we had around 250 valuable UI tests. It took us around two weeks working together with test owners to fix existing flaky tests, which either used flaky APIs or failed to meet our guidelines, to achieve a stable test suite.

Trunk Guardian

In this section, we are going to discuss our solution to detect and manage flaky tests. We call it the Trunk Guardian because its sole job is to guard the code base from being blocked because of flaky tests.

Now, you might ask at this point, "What about the problem of abuse or negligence? You’ve got all the awesome tools and incredible guidelines set up, but what if people misused your tool or didn't remember your guidelines, and just committed flaky tests?" I’m going to plot out our strategy to guard our code base from any newly added flaky tests.

A quote from Square’s blog:

“At our most stable, we’ve measured one failure out of 1500 builds in a week. Approaching 99.9 percent stability has massively increased the trust in the iOS CI (Continuous Integration) system. We’re working on applying this approach and analysis to other areas where CI is practiced at Square.”

Driven by the premise that flaky tests are worse than no tests, we decided to have a trunk on-call team to babysit the code base. Here's how it works. If a commit is reverted due to a test failure, the on-call would try to analyze the test. If that commit was totally innocent and irrelevant to the test failure, he/she will disable that test failure

Our trunk monitoring strategy is as follows::

If a build failed to pass tests, there will be an on-call engineer investigating the reasons of this failure. The on-call engineer will have to spend lots of time diving into code base, checking the logs and make a decision of the root cause. Based on that, if the tests failed due to some bad code changes, we take the test failure as self-inflicted failure. Otherwise, the failed tests are regarded as flaky and our on-call engineer will disable them.

The investigation part is a heuristic because our computers are not smart enough to deduce the failure causes for that test, so we have to manually look into the log and decide whether we should keep the test or disable it.

Let's step back further and reconsider the reason why we need software testing. Are tests used for the existing code base, or are they there to prevent future regression bugs? Most times, after a feature is complete, it is the latter one. So if a test cannot pass all the time in the existing system, why should we run it?

Based on our test guideline No. 3 mentioned previously, tests should be reliable; if you run the same test 1000 times, you should get the same result 1000 times. So for the same code, if a test passed before but failed later, it is flaky and should be removed from our test suite. And this is also the philosophy of Trunk Guardian.

Based on this and theorem 1, we have the following corollary: “Any flaky test should not exist in our testing suite”

Based on Corollary 1, we can turn to such a new model:

With the same code base, if a test passed before but failed later, that test is flaky and should be disabled.

And with this, we are able to do it with our machines. We mark a build LKG (last known good) if it passed all tests in our building pipeline. Since all tests already passed once in any LKG build, if we run a test against the same LKG build and failed, that test should be regarded as flaky and disabled.

Trunk Guardian schedules testing jobs against LKG in our machine pool if there is enough capacity. If any test failed that, we disable that test and report that to the test owner.

Here is a design graph of our Trunk Guardian service:

Trunk Guardian is developed in Python and it is operating on top of our continuous integration (CI) environment. After introducing Trunk Guardian to our building system, the rate of flakiness in tests dropped from a high point of 40% down to 0.1% with 700 UI tests and 1100 unit tests.

Conclusion

Based on our experience, we can safely conclude that it requires some significant effort to build a robust UI testing ecosystem for mobile clients, but it is possible. Having this robust testing ecosystem setup greatly speeds up our velocity building all iOS apps and catalyzes our multi-app strategy. Also, building a robust UI testing ecosystem has an incredibly huge impact on our mobile engineering team. It makes our engineers an example of our engineering culture to demand excellence and pursue craftsmanship, and makes our developers much happier and much more productive.

Acknowledgements

Thanks to the team that made this happen:

Jacek Suliga, Serguei Kasianov, Yuichi Sasaki, Ashit Gandhi, Kamilah Taylor, Frederick Fung, Shaobo Sun, Ankit Goyal, Kane Ho

Topics: A/B Testing/Experimentation Product Design