A/B Testing and Statistical Significance

Phillip_Wild · ‎06-02-2016

* Posting this as a discussion since I don't have blogging rights. But hey, the best articles are discussions anyway, right?

Stick your hand in a jar and pick out five jellybeans. What colour are they? 3 blue and two red?

Great. So the jar must be 60% blue jellybeans and 40% red. Right?

Well, no, not necessarily, most of us would say. But this is exactly the mistake we often make when A/B testing our email marketing programs. There's a harsh reality that we need to face as A/B testers...sometimes the effects you are "seeing in the data" aren't really there at all.

Setting up your tests correctly and analysing the results with care will ensure that false assumptions are minimised and the majority of effects are found.

False Positives

Cast your mind back to high school maths....yep, it's further than we might like, agreed. But if the phrase "p-value" doesn't mean anything to you now, then it should soon!

Let's say you run an A/B test in Marketo on subject line. We theorise that including a person's first name in the subject line will increase open rates. Here's our test setup:

H0 (or the null hypothesis): Including a person's first name in the subject line does not have an effect on the open rate.

H1 (or the alternative hypothesis): Including a person's first name in the subject line DOES have an effect on the open rate.

Version A, the control without the first name used, results in 400 opens from 1200 deliveries. Version B, the alternative WITH a first name used, results in 420 opens from 1200 deliveries. Great, right? Let's put that first name in every email we send! Well, possibly. But not necessarily.

While Marketo shows you a pretty little column chart, you will have to delve a bit deeper to get any certainty about your results. For this step, you're going to need to use a mathematical function known as the chi-squared test. Don't stress, it's easy to use, and there's an online version you can find here.

You plug in the numbers and get the result: "one-tailed p-value: 0.194". What does this mean? Technically speaking, it's the chance of seeing the results you did assuming that the null hypothesis is true. In this case, 19.4%. That's a lot.

In our situation, it's the chance that you saw an effect –and probably modified your email strategy going forward–based on an effect that simply is not there. That's a big deal.This might be misleading you for years to come if it become part of your core subject line strategy.

So if you see a p-value of 0.194....you probably couldn't make an informed decision with simply this one test to guide you. As a good rule of thumb, anything above 0.05 is considered not "statistically significant". That means that what you observed could easily be random variation - like our jellybeans earlier.

A p-value of 0.05 is a 5% margin of error. So, how sure do you want to be? This depends on the importance of the test on your overall strategy. If it's quite important, I wouldn't bet the farm with a 5% chance of being wrong - much less a 20% chance!

So...how do I reduce the p-value and get more confidence in my test results?

There are a couple of ways you can do this.

Increase the sample size. The more people that are in your test, the more confident you can be. Going back to our jar example - if you draw out 2 red jellybeans in a row, what's more likely - that the jar only has red jellybeans, or that it's pure coincidence? What about if you drew out 50 in a row? 500? 5 million? You get the idea. With a larger sample comes more confidence. Try it yourself with the chi-squared calculator - instead of 400 / 1200 and 420 / 1200, try 40,000 / 120,000 and 42,000 / 120,000. The p-value drops to zero. Zero percent chance of a false positive! I like those odds.
Increase the effect. The larger a difference your effect makes, the larger the p-value. If instead of 420 opens, our alternative resulted in 500 opens, the p-value drops to virtually zero. Of course, increasing the effect is easier said than done!
Repeat the test. If you can repeat the same test on the same audience, then you can be more certain that the effect you are seeing is actually there. Let's say that you complete the same test twice, with p-values of 0.05 both times. Now - the chance of seeing those results if the null hypothesis is true is 0.05 * 0.05 - which is now 0.0025, or 0.25%. You're now getting 1 in 400 wrong, instead of 1 in 20. Be careful doing this though, since the simple fact that you've already tested the effect on the audience will bias the second test somewhat. Unfortunately, you can't test in a vacuum!

By testing with one eye on the p-value, you will ensure that you're not running too large a risk of incorrect conclusions. If you'd like some further reading on this, I recommend the Email Marketing TIpps website and email newsletter. You can use this article as a starting point. Ignore the 1990s styling, it's an excellent resource for those interested in both statistics and email marketing.

Happy testing!

Julie_Leo · ‎10-16-2018

We recently ran a A/B test yielding high custom conversion results shown below. The custom conversion was set for clicks on a specific URL in the email. The test was set for each email to be delivered to an equal number of people. The conversion metric The metrics appear high compared other A/B metrics I've seen. Do they make sense to you?

Email A: 0.901

Email B: 0.994

The test was set for each email to be delivered to an equal number of people.

Thanks,

Julie

Phillip_Wild · ‎10-17-2018

Hi Julie

Depending on the email, to have a click rate of close to 1% on a particular URL seems quite reasonable.

Grace_Brebner3 · ‎10-16-2018

Hey Julie,

You might want to create a new thread for your question as this thread is two years old. You'll also need to provide more info about what your custom conversion rules were, we can't say whether the numbers make sense without knowing what they relate to.

Kristy_Murphy · ‎09-19-2017

Very helpful! Great add.

Colin_Mann · ‎06-23-2016

Very interesting thinking and it makes a lot of sense thank you. It's been 20 years since I last carried out chi-squared testing!

Ayan_Talukder · ‎06-22-2016

Thanks for sharing Phillip, this is great!

Geoff_Krajeski1 · ‎06-08-2016

Great add Phillip Wild. Gets my mind thinking...

Much agreed with the 5% rule!

Geoff_Krajeski1 · ‎06-08-2016

Gets me thinking of the long-tail effect of our efforts:

What was the 'touch influence' of the test options?
How many changed RCE Stage?
How many opportunities were opened as a result?
How many became Closed/Won?
Etc?

Phillip_Wild · ‎06-08-2016

Yep. Of course, you need larger and larger samples to test anything further down the funnel since conversion rates are typically pretty low for those metrics. But if you can use that as your criteria, that's better tied to revenue than simply opens or clicks!

Anonymous · ‎06-07-2016

Wild thing! Great add.

Justin_Norris1 · ‎06-07-2016

This is super useful. Thanks Phillip Wild.