Better ROI on end-to-end tests

Typically there are three major problems with browser-driven end-to-end tests:

  1. slowness
  2. flakiness
  3. brittleness

Testing terminology is hilariously overloaded so before going any further, let me clarify what I mean by end-to-end test:

A helper robot that behaves like a user to click around the app and verify that it functions correctly  - Testing JavaScript

So the whole point is to verify that your app isn't busted, from your user's point of view.

But when you combine slowness and flakiness and brittleness, and multiply that by a growing team and a growing test suite, and throw it all in a blender, you end up with a VERY SUBOPTIMAL smoothie.

You end up with a sluggish feedback loop around the correctness of your app. When something IS busted it's unclear what – is it the app, or the test, or some external dependency that's busted? You end up with a compounding problem that throws a wrench in your team's workflow, wasting a lot of engineering time.

So...maybe you throw them out altogether.

But having NO end-to-end tests is an even worse position to be in! (<-- the hill I will die on)

There are a lot of different ideas about the relative proportions of different test types (unit, integration, e2e, etc.) that should make up your collection of automated tests. I'm not going to try to convince you that there is one ideal "shape"  –  a test pyramid, or an inverted pyramid, or a rutabaga, or whatever –  that will work across all organizations and projects. It depends.

But a foundational assumption for the rest of this post is that it is usually worth having some number of end-to end tests for critical business functions, because over time those tests are likely to save you a bunch of time and prevent a nontrivial number of customer-facing incidents.

This post will discuss strategies for chipping away at each of slowness, flakiness, and brittleness, to improve the return on time-investment for end-to-end tests.

Automate less

As developers, it's easy to get carried away with the idea of automating EVERYTHING.

We often overlook the fact that the expected value of automated test cases varies widely across different user stories / features, and we underestimate the actual time commitment:

  • time to write the test in the first place (one-time)
  • time to maintain the test (ongoing)
  • time that each person has to wait on each additional test running (ongoing)
  • time spent interpreting noise and fixing things when a given test fails (ongoing)

Thus the most powerful lever we can pull is choosing not to automate something in the first place unless it's really worth it.

This will positively impact slowness, flakiness, and brittleness, all in one go.

But how do we decide if a given test case is worth automating?

Watch the amazing talk Which Tests Should We Automate? by Angie Jones.

Angie shares a step-by-step approach and an example rubric to put each potential automated test case through, factoring in risk, value, cost-efficiency, and historical data. (The breakdown of "actual time" above also comes from that talk).

It's easy to think you'd arrive at the same conclusions by carefully asking yourself for each test "is this one worth it?" – but you almost certainly wouldn't.

Use a repeatable, tunable process for carefully evaluating the factors that are important for you.

If you only do one thing, do this!

Also, give careful consideration to how many browsers you really need to test against.


One way to address end-to-end tests that are painfully slow to run is hand-optimization of individual tests that are slow.

But global optimizations seem like a better place to start.

Balancing tests across a fixed number of parallel workers, so that each test group runs in about the same time, is a huge improvement over running the tests one-at-a-time.

But what if we could run all the tests at the same time, so that the total time is only about as long as the slowest test?

UI Testing at Scale with AWS Lambda talks about Blackboard's journey from:

  • running tests one-at-a-time (three hours)
  • to balancing tests across Docker containers, each running a fixed number of tests concurrently via threads (16 minutes)
  • to maximizing parallelization on AWS Lambda (39 seconds!)

I've written about saving loads of time and money by parallelizing both end-to-end tests and performance tests on AWS Lambda.

There are plenty of use cases and workloads for which functions-as-a-service are not currently a great fit, but if you can work around the complexities of getting a browser running in this environment, the advantages of running a spiky workload like end-to-end tests here can be huge.

Regardless of specifics, the most important thing is to focus on the global optimizations first and find a way to achieve greater than 1x concurrency.


What exactly is flaky/janky/intermittent/unreliable, and can we directly impact it? Some common sources of flakiness:

  • web services our app depends on that our organization controls
  • web services our app depends on that our organization does NOT control
  • the platform – the testing tool itself, the browser itself, the network, our dev and staging environments, etc.

If our end-to-end tests are flaky because our staging environment is generally unreliable, the solution is to address that unreliability directly – the flakiness in the e2e tests is just a symptom of a bigger problem.

If we can see what is unreliable and directly fix it at the source, we should.

But it's possible we see the source of flakiness and can't directly impact it – perhaps it's a service controlled by another organization.

And there is a good chance that the tools we're using limit our visibility into the problem, so maybe we can't identify the source of flakiness or debug effectively in the first place.

Another common issue with end-to-end tests is that it's hard to isolate exactly what is broken. Part of this is unavoidable – by definition we are testing EVERYTHING so there is a wide range of things that can fail – but by improving visibility and flakiness, we can lower the frequency of failures and make it easier to understand them when they do happen

I'm cautiously optimistic that the latest generation of end-to-end testing tools – most notably Puppeteer and Cypress – offer promising solutions that improve visibility, debuggability, and flakiness head-on.

Especially Cypress, which distinguishes itself by running within the browser, in the same run-loop as your app, rather than executing remote commands over the network. Cypress is fast, offers unprecedented visibility into what's happening in your app when failures occur, comes with nice tools for debugging, and has an active and helpful community (and great docs!).

Here are a couple useful posts about folks' experience migrating from Selenium to Cypress.

In the event that you know what's flaky but can't directly control it, both Puppeteer and Cypress offer much simpler methods to stub external services than what you'd have to do with Selenium, if you decide to go down that road.


brit·tle: hard but liable to break or shatter easily.

End-to-end tests are typically liable to shatter easily – which makes them a real pain to maintain over time – because of problems related to specificity.

For example: we want to verify that after a user clicks a button to order a pizza, they see a "success" message. In our automated test, we need to somehow select the button to click.

  • If we say "give me the first button element", that's probably not specific enough. If somebody adds another button to the page, the test could break.
  • If we say "give me the button that says ORDER NOW", that is probably too specific. If somebody changes the text to "ORDER", the test will break.
  • If we say "give me the button with the CSS class btn-primary", that might be wrongly specific. Somebody might change the class to btn-secondary, totally unaware of the test depending on it, and the test will fail.

The challenge is that an actual user will look at the page and figure out where to click, regardless of the minor details, because they want to get some of that sweet pizza. The automated test, sadly, isn't motivated by actual pizza and depends heavily upon the fickle details.

I've found using data attributes to be a solid approach to reduce brittleness. For example, we could add an attribute of data-test-id=order-pizza-button to the button and select the button based on that. This also makes it obvious within the code that a test depends on the attribute.

At the very least, if you're selecting elements based on classes, it's useful to make sure the classes are strong indicators of the page structure and semantics, not the presentation (e.g. don't select based on center-align).


There's no magic bullet here, but we can chip away at the downsides of end-to-end tests by:

  • Using a process to carefully choose what to test in the first place
  • Addressing slowness/flakiness/brittleness one-by-one, starting with the more global optimizations and then moving toward individual test cases
  • Exploring and embracing new tools where possible (I recommend trying Cypress)

And taken together, these improvements can dramatically change the ratio of time spent on end-to-end tests to the value captured from them.

Show Comments