* Posting this as a discussion since I don't have blogging rights. But hey, the best articles are discussions anyway, right?
Stick your hand in a jar and pick out five jellybeans. What colour are they? 3 blue and two red?
Great. So the jar must be 60% blue jellybeans and 40% red. Right?
Well, no, not necessarily, most of us would say. But this is exactly the mistake we often make when A/B testing our email marketing programs. There's a harsh reality that we need to face as A/B testers...sometimes the effects you are "seeing in the data" aren't really there at all.
Setting up your tests correctly and analysing the results with care will ensure that false assumptions are minimised and the majority of effects are found.
False Positives
Cast your mind back to high school maths....yep, it's further than we might like, agreed. But if the phrase "p-value" doesn't mean anything to you now, then it should soon!
Let's say you run an A/B test in Marketo on subject line. We theorise that including a person's first name in the subject line will increase open rates. Here's our test setup:
H0 (or the null hypothesis): Including a person's first name in the subject line does not have an effect on the open rate.
H1 (or the alternative hypothesis): Including a person's first name in the subject line DOES have an effect on the open rate.
Version A, the control without the first name used, results in 400 opens from 1200 deliveries. Version B, the alternative WITH a first name used, results in 420 opens from 1200 deliveries. Great, right? Let's put that first name in every email we send! Well, possibly. But not necessarily.
While Marketo shows you a pretty little column chart, you will have to delve a bit deeper to get any certainty about your results. For this step, you're going to need to use a mathematical function known as the chi-squared test. Don't stress, it's easy to use, and there's an online version you can find here.
You plug in the numbers and get the result: "one-tailed p-value: 0.194". What does this mean? Technically speaking, it's the chance of seeing the results you did assuming that the null hypothesis is true. In this case, 19.4%. That's a lot.
In our situation, it's the chance that you saw an effect –and probably modified your email strategy going forward–based on an effect that simply is not there. That's a big deal.This might be misleading you for years to come if it become part of your core subject line strategy.
So if you see a p-value of 0.194....you probably couldn't make an informed decision with simply this one test to guide you. As a good rule of thumb, anything above 0.05 is considered not "statistically significant". That means that what you observed could easily be random variation - like our jellybeans earlier.
A p-value of 0.05 is a 5% margin of error. So, how sure do you want to be? This depends on the importance of the test on your overall strategy. If it's quite important, I wouldn't bet the farm with a 5% chance of being wrong - much less a 20% chance!
So...how do I reduce the p-value and get more confidence in my test results?
There are a couple of ways you can do this.
By testing with one eye on the p-value, you will ensure that you're not running too large a risk of incorrect conclusions. If you'd like some further reading on this, I recommend the Email Marketing TIpps website and email newsletter. You can use this article as a starting point. Ignore the 1990s styling, it's an excellent resource for those interested in both statistics and email marketing.
Happy testing!
This is super useful. Thanks Phillip Wild.
Wild thing! Great add.
Great add Phillip Wild. Gets my mind thinking...
Much agreed with the 5% rule!
Gets me thinking of the long-tail effect of our efforts:
Yep. Of course, you need larger and larger samples to test anything further down the funnel since conversion rates are typically pretty low for those metrics. But if you can use that as your criteria, that's better tied to revenue than simply opens or clicks!
Thanks for sharing Phillip, this is great!
Very interesting thinking and it makes a lot of sense thank you. It's been 20 years since I last carried out chi-squared testing!
Very helpful! Great add.
We recently ran a A/B test yielding high custom conversion results shown below. The custom conversion was set for clicks on a specific URL in the email. The test was set for each email to be delivered to an equal number of people. The conversion metric The metrics appear high compared other A/B metrics I've seen. Do they make sense to you?
Email A: 0.901
Email B: 0.994
The test was set for each email to be delivered to an equal number of people.
Thanks,
Julie