Phillip Wild

A/B Testing and Statistical Significance

Discussion created by Phillip Wild on Jun 2, 2016
Latest reply on Oct 17, 2018 by Phillip Wild

* Posting this as a discussion since I don't have blogging rights. But hey, the best articles are discussions anyway, right?

 

 

Stick your hand in a jar and pick out five jellybeans. What colour are they? 3 blue and two red?

 

Great. So the jar must be 60% blue jellybeans and 40% red. Right?

 

Well, no, not necessarily, most of us would say. But this is exactly the mistake we often make when A/B testing our email marketing programs. There's a harsh reality that we need to face as A/B testers...sometimes the effects you are "seeing in the data" aren't really there at all.

 

Setting up your tests correctly and analysing the results with care will ensure that false assumptions are minimised and the majority of effects are found.

 

False Positives

 

Cast your mind back to high school maths....yep, it's further than we might like, agreed. But if the phrase "p-value" doesn't mean anything to you now, then it should soon!

 

Let's say you run an A/B test in Marketo on subject line. We theorise that including a person's first name in the subject line will increase open rates. Here's our test setup:

 

H0 (or the null hypothesis): Including a person's first name in the subject line does not have an effect on the open rate.

H1 (or the alternative hypothesis): Including a person's first name in the subject line DOES have an effect on the open rate.

 

Version A, the control without the first name used, results in 400 opens from 1200 deliveries. Version B, the alternative WITH a first name used, results in 420 opens from 1200 deliveries. Great, right? Let's put that first name in every email we send! Well, possibly. But not necessarily.

 

While Marketo shows you a pretty little column chart, you will have to delve a bit deeper to get any certainty about your results. For this step, you're going to need to use a mathematical function known as the chi-squared test. Don't stress, it's easy to use, and there's an online version you can find here.

You plug in the numbers and get the result: "one-tailed p-value: 0.194". What does this mean? Technically speaking, it's the chance of seeing the results you did assuming that the null hypothesis is true. In this case, 19.4%. That's a lot.

 

In our situation, it's the chance that you saw an effect –and probably modified your email strategy going forward–based on an effect that simply is not there. That's a big deal.This might be misleading you for years to come if it become part of your core subject line strategy.

 

So if you see a p-value of 0.194....you probably couldn't make an informed decision with simply this one test to guide you. As a good rule of thumb, anything above 0.05 is considered not "statistically significant". That means that what you observed could easily be random variation - like our jellybeans earlier.

 

A p-value of 0.05 is a 5% margin of error. So, how sure do you want to be? This depends on the importance of the test on your overall strategy. If it's quite important, I wouldn't bet the farm with a 5% chance of being wrong - much less a 20% chance!

 

So...how do I reduce the p-value and get more confidence in my test results?

 

There are a couple of ways you can do this.

 

  1. Increase the sample size. The more people that are in your test, the more confident you can be. Going back to our jar example - if you draw out 2 red jellybeans in a row, what's more likely - that the jar only has red jellybeans, or that it's pure coincidence? What about if you drew out 50 in a row? 500? 5 million? You get the idea. With a larger sample comes more confidence. Try it yourself with the chi-squared calculator - instead of 400 / 1200 and 420 / 1200, try 40,000 / 120,000 and 42,000 / 120,000. The p-value drops to zero. Zero percent chance of a false positive! I like those odds.
  2. Increase the effect. The larger a difference your effect makes, the larger the p-value. If instead of 420 opens, our alternative resulted in 500 opens, the p-value drops to virtually zero. Of course, increasing the effect is easier said than done!
  3. Repeat the test. If you can repeat the same test on the same audience, then you can be more certain that the effect you are seeing is actually there. Let's say that you complete the same test twice, with p-values of 0.05 both times. Now - the chance of seeing those results if the null hypothesis is true is 0.05 * 0.05 - which is now 0.0025, or 0.25%. You're now getting 1 in 400 wrong, instead of 1 in 20. Be careful doing this though, since the simple fact that you've already tested the effect on the audience will bias the second test somewhat. Unfortunately, you can't test in a vacuum!

 

By testing with one eye on the p-value, you will ensure that you're not running too large a risk of incorrect conclusions. If you'd like some further reading on this, I recommend the Email Marketing TIpps website and email newsletter. You can use this article as a starting point. Ignore the 1990s styling, it's an excellent resource for those interested in both statistics and email marketing.

 

Happy testing!

Outcomes