The golden rule of A/B testing: look beyond validation

Data Analyst, Intercom

Main illustration: Andrew Watch

A/B tests provide more than statistical validation of one execution over another. They can and should impact how your team prioritizes projects.

Too often, teams use A/B testing to validate bad ideas. They make minor changes and hope the test will produce big wins. But these tests can be counterproductive. Results that are a product of random variation (e.g. not statistically significant) yield unhelpful insights, and even good results are not guaranteed to hold true when your winning variable is shipped to a full audience.

If you adhere to A/B testing best practices and ask the right questions before running a test, you’ll learn what types of changes are actually worth your time and focus on projects that produce meaningful insights.

The impact of statistical power

There’s a lot written on frequent A/B testing mistakes. The most common error is calling tests too soon based on statistical significance. Sometimes your test result is significant because there’s an actual effect, other times it’s due to sheer noise. After all, a random sample is never going to be a perfect representation of the full population.

In order to differentiate an effect from noise, you not only need statistical significance but also statistical power. If you have more statistical power you can be more certain the lift is actually real.

To get enough power and run a test correctly, ask yourself:

How much do you think the change will increase the associated key performance indicator (KPI)?
Given this desired effect, how long will you need to run the test to get accurate results?
Is it worth the wait?

1. How much do you think the change will increase the KPI?

Say you want to improve your signup flow. You have a list of ideas and are trying to decide which ones to work on. You’re not happy with the UI, but a complete overhaul will take a month to design and build. On the other hand, you could try out a different color scheme, which won’t take long to change. Your team hypothesizes that the complete redesign can boost conversion from 10% to 15% whereas the color scheme change may boost conversion from 10% to 11%.

2. Given this desired effect, how long will you need to run the test to get accurate results?

Take your answer to the question above and plug it into this sample size calculator. Under the hood there’s serious statistics involved, but the basic logic is smaller effects take longer to detect whereas larger effects will be obvious sooner. This is an important insight: detecting small changes is expensive. In our example, the color scheme change only takes two days to build, but we’ll need 24x more data to test a 1% versus a 5% absolute lift.

3. Is it worth the wait?

With sample sizes in mind, you should consider how long a test would need to run to achieve reliable results. Say your signup flow gets 600 visits per day. The complete redesign will require two days to gather enough data while the color scheme change will take much longer. So the larger project takes 32 days to develop and test, while the smaller project takes 49. They both take a lot of time, but the complete redesign has more potential.

Focus on projects with bigger potential upside

Late last year, we thought our signup flow could do better. The layout on our previous signup page (shown below as the Control) was not logically organized. The integration methods were not displayed in an order that was most likely to be relevant to new users. And since our signup flow doesn’t see hundreds of thousands of visitors, we knew we had to test something that we thought was much better.

We wanted our signup conversion to increase by 10%. And after getting enough data, we looked at the signup rates between versions. Disappointingly, the difference was small and not statistically significant. The test was a wash, but that’s okay. We still decided to use the new version, because the signup experience was cleaner and put us in a better position to iterate and improve in the future.

What this approach means for startups

For young companies, small optimization projects just don’t make sense. Tests take a long time to run and distract you from working on projects that matter. You might see a slight uptick, but you’re likely working towards a local maximum. To get on the path towards the global maximum, young companies need to make big changes.

Larger companies have spent a lot of time working on flows and understanding their customers. They’re better suited for small improvements because they have the traffic to conclude tests faster. Plus, a 1% improvement means a lot more when you have hundreds of thousands of visitors a day.

Once you’ve made several large tests, have a good understanding of your customers and have large traffic volume, you can work on small optimizations. Remember, not every A/B test is going to yield the result you want. What’s important is that you determine your improvement goal and account for how long you need to run the tests to get the right kind of results. Otherwise you risk spending a lot of resources only to get back a negative ROI.