Making the Most out of Your Marketo A/B Test: The Statistics you Need to Know

Jessica_Kao3 · ‎09-26-2016

We’ve all used Marketo or other automation tools to A/B test emails and landing pages. We do it because we want to optimize engagement through constant iterations, and we can use the results to give our content its best shot at provoking responses from our prospects.

But have you ever had the nagging feeling that your high school statistics teacher wouldn’t approve of your testing technique? You remember terms like sample size, variables, and p-value that were important parts of your hypothesis testing, but they all seem to be missing in Marketo’s tool today.

It turns out those principles we learned are still integral to executing a successful A/B test and preventing incorrect conclusions. Luckily you don’t need a stats degree to implement these principles and enhance the tests that your organization performs. So let’s dive into how to design and interpret a more meaningful Marketo A/B test.

I. Designing your A/B Test

Your test design is the most important factor in determining whether you will get insightful information from your results. Over and over, we see the same common experimental design fallacies in tests run by marketers. Let’s take a look at what they are and how to overcome them.

Sample Size is Too Small

How large does my sample size really need to be? We get this question a lot and wish there was a definitive answer. But we would by lying to you if we said there was because it depends on how big the difference is that you want to see.

Say you want to do a simple subject line A/B test and you send to 1000 recipients.

Subject Line A: [Webinar] How to make the most of your A/B tests

Subject Line B: [Webinar] Register Now: How to make the most of your A/B tests

Half get Subject A and half get Subject B. If 6% open A and 7.4% open B, can you draw the conclusion that having a CTA “Register Now” performed better? Is the difference between A and B significant enough to declare that B is “better”? We can’t really answer that until we look at the p-value and how to get the p-value, which is covered later. For now, smaller p-values are better and in this case the p-value = 0.376 which is not good. You might think “Subject Line B still got higher number of opens, so why don’t we just go with that?” What the results are also saying is that the chances of you getting the opposite results if you ran the test again is pretty high.

If we run the test with 10,000 recipients total with the same percentages opening A and B respectively, the p-value is significantly smaller at 0.0051 which is excellent. (Scientific publication guidelines accept <0.05 and this is just marketing.) With the results from the second scenario you can confidently conclude that adding a CTA makes a difference. The combination of your target size and the difference between your two test groups determines what conclusions you can draw from your results.

Changing too many variables at once

As marketers we get excited about testing different variables. Sometimes we go overboard and test too many variables at once which leads to the failure to conclude anything. Let’s demonstrate with a landing page test.

Landing Page A: Blue button with CTA = Submit

Landing Page B: Green button with CTA = Download Now

In this case we have a question: Which button performs better? If Landing Page A has a significantly higher conversion rate than Landing Page B, what is my actionable intelligence moving forward? Unfortunately, we do not know if it is the color or the words on the button or both that was the contributing factor. (If you want to geek out this is called a confounded experiment.)

The proper way to carry this out is to break out the testing out into two rounds.

Test #1

Landing Page A: Blue button with CTA = Submit

Landing Page B: Green button with CTA = Submit

Result: Landing Page A performed significantly better.

Test #2

Landing Page A: Blue button with CTA = Download Now

Landing Page B: Green Button with CTA = Download Now

Result: Landing Page A performed significantly better.

Conclusion: LP with a blue button and an active CTA should be implemented.

If you vary multiple factors at once in the two test groups, you will not be able to conclude which of the variables that you changed contributed to the performance of one group over the other. Setting a series of tests to vary one variable at a time allows you to truly understand the contribution of each.

Testing without a clear question or hypothesis

Have you ever carried out an A/B test and then asked yourself “What do I do with the result? How can I apply this to future campaigns?” This confusion often occurs because you designed your test without a clear hypothesis.

Here’s an example of a subject line test with 6 groups.

A: Learn from CMOs: Engagement Strategies

B: How to effectively market to your prospects

C: Top strategies for engaging your prospects

😧 Top strategies for reaching your prospects

E: Web Personalization: Reach and engage your prospects

F: Drive greater engagement this holiday season

If subject line C was declared the winner with the greatest number of clicks (albeit by a slim margin), what have we learned to apply for the next time? Also, with this many variables you will need a very large sample size to declare this result to be significant.

A better strategy would be to break out into a series of tests where we can test a single variable at a time with a clearly defined question or hypothesis.

Question #1: Does having CMO in the subject line drive more opens?

Test#1:

Subj A: Learn from CMOs: Engagement Strategies for your Marketing

Subj B: Learn Engagement Strategies for your Marketing

Question #2 Does the word “reaching” or “engaging” drive more opens?

Test #2 (Assuming CMO won test #1):

Subj A: Learn from CMOs: Top strategies for reaching your prospects

Subj B: Learn from CMOs: Top strategies for engaging your prospects

Question #3: Does mentioning “holiday season” results in a greater open rate?

Test #3 (Assuming reaching won test #2):

Subj A: Learn from CMOs: Top strategies for reaching your prospects

Subj B: Learn from CMOs: Top strategies for reaching your prospects this holiday season

Remember that it’s called an A/B test, not an A/B/C/D/E/F test. Break down your question into specific parts that can be tested in a series of A/B tests, rather than trying to get an immediate answer by testing all at once. The next time you are deciding what individual elements of a subject line will maximize engagement, you can look back at the results of these tests.

Using the Email Program A/B test results to declare a “Winner”

In the Marketo, it is really easy to set up an A/B test using the Email Program and see the results. Let’s go back to our simple subject line test for registering for a webinar.

Subject Line A: [Webinar] How to make the most of your A/B tests

Subject Line B: [Webinar] Register Now: How to make the most of your A/B tests

Say you have 50,000 leads in your target list and you choose to test 20% of your list and send the remainder the winner. That means 5,000 will get subject line A and 5,000 will get subject line B. The subject line that is declared the winner will be sent to the remaining 40,000. That sounds pretty straight forward. But (and you knew there was a but...) how is a winner determined and which one should you choose?

Marketo lets you set the winning criteria and automatically send the winner a minimum of 4 hours later. You can choose from the following:

Opens

Clicks

Clicks to Open %

Engagement Score

Custom Conversion

In this case if we choose opens, that means that the difference in the subject line is the difference in whether someone opened the email or not. Is this the behavior that matters most? In some cases that might be, but in a webinar we probably want to look at clicks instead. For example, we once saw an email that had the larger open rate also had less registrations and a 10 times higher unsubscribe rate. This led us to conclude that our message was not resonating with the target audience.

Setting the winning criteria to Clicks to Open % could also be problematic. If email A had 1000 opens and 40 clicks (4%) but email B had 200 opens and 20 clicks (10%), email B would be declared the winner even though the absolute number of people who clicked is lower.

What about setting the winning criteria to clicks? If Email A had 1000 clicks and Email B had 100 clicks, Email A would be declared the winner. But if the desired behavior is registering for the webinar and Email A had 10 people register for the webinar vs 25 for Email B, was email A really the “winner”?

So… which one should you pick?

Unfortunately you won’t know until you look at the data after the results come in. There is no way to predict. We can think of a potential situation where any of the choices above would work or not work, it will just depend on what the data says. So if you are going to declare a winner n a Marketo A/B test, we prefer to do it manually.

“When I test, I typically test on 100% of my target list. If I have an A/B test with 2 groups, I set the slider bar to 100%. That way, 50% get A and 50% get B. I do this for a number of reasons. Because, you won’t know if you have a large enough sample size until after the test. If you run 10 different tests on 1000 people and the difference is small, your results will all be inconclusive. I would rather run 1 test on 10,000 targets and get a really solid conclusion.“

When you are designing a test, ask yourself, “What am I going to do with this information? What am I going to change?” Don’t test for the sake of testing. Whatever you decide to test, ensure that the question you are asking is going to be actionable. Now that you know how to design robust A/B tests, how do you interpret those results?

II. Testing and Interpretation of Results

Setting up the test correctly is half the story, making sure that we are drawing the correct conclusions is the other half and just as important.

Unfortunately, we cannot “declare a winner” by simply picking the test group with the most opens or clicks. When we run a test we are saying, this small population of 1000 people is a representation of the whole universe. It is not possible to test everyone in the whole world. We are extrapolating that how this sample population behaves is going to predict how the rest of the world would behave. But. . . we know that if we ran the test on 10 different sets of 1000 people, I would get slightly different results, so there is a chance albeit small, that I might have picked a sample population that is an outlier so different then the rest of the world my results could lead me astray. This slight variation is what we need to account for by calculating a p-value.

Let’s go back to our subject line test.

If you sent a total of 1000 emails and 30 people opened email A and 31 people opened email B, could you say email B leads to more opens? The answer is no (based on the calculation of the p-value). Just because Opens of email B is > than opens of email A doesn’t mean that if you hypothetically ran the test again you would get the same results. In this case it’s about as good as flipping a coin. You could get either result.

The real question in A/B testing is: “Is the difference between A and B SIGNIFICANTLY different enough for you to draw the conclusion CONFIDENTLY that B is greater than A when you run the test again and again. You want to be able to confidently say, based on the results of the test, I believe B will most likely yield more than A if I were to run the same test in the future. Therefore, we should move forward with B. That’s the goal.

To determine whether the difference is significant or not we look at the p-value of our test. We are not going to go into how this value is calculated, but we will examine:

How to use a very simple tool to obtain the p-value
How to interpret the p-value
What it means in plain english

You can use this website to input the results of your A/B test and generate a p-value. (This calculator was posted by @ Phillip Wild. A/B Testing and Statistical Significance. Great suggestion)

Let’s take a look at another example.

You run an Email A/B test separated into groups with two different button colors, green and blue for the call to action. Your question is which button color is associated with more clicks.

Green: 93 clicks on 4,000 emails delivered

Blue: 68 clicks on 4,000 emails delivered

You take the number of clicks for each group and plug them into the Calculator. A/B test under the successes for each group. You enter 4000 into the total for each group.

The resulting two-tail p-value = .047.

It is generally accepted that a p-value of <=0.05 is considered a significant result. The smaller the p-value, the better and more confident you can be in your results. We can conclude that there is significantly higher number of clicks using Green vs Blue. I am confident that if I were to run this experiment again and again, I would obtain the same result. Therefore, I would make the recommendation to change the CTA button color to green.

What does this p-value number mean in plain english?

A p-value of 0.047 is saying is that there is a 4.7% chance that you could have obtained these results by random chance and that if you were to run this experiment again you would not see the same result.

What is so special about a p-value cut off of 0.05?

It is in fact an arbitrary cut off but is the absolute gold standard and is used in the scientific and medical community in the most highly respected peer reviewed publication.

If your p-value is slightly more than .05, say .052, don’t automatically write off the result as inconclusive. If you have the ability, test the same hypothesis again with a different or larger sample size.

Note: When using this tool, plug in your number of successes (opens, clicks etc.) and total (number of delivered emails) for each group. Note that when using click to open ratio, you will be using number of clicks as the success and number of opens as the total, NOT the number of emails sent.

This calculator gives us the p-value of the test, and we want to look at the two-tail value specifically. The p-value of a two-tail test represents the likelihood that there is a statistically significant difference in what we are measuring between the two groups in the test, compared to when there is actually no true difference. If the p-value is smaller than .05 we can conclude that there is a 95% or more chance that there is a difference between the two statistics (open rate, clicks) and act upon that in our decisions for future marketing communications. If the p-value is above .05, then the results of the test are inconclusive. This value and interpretation allows us to stay consistent from test to test.

A key here is to not consider the test a failure if the results are inconclusive (p-value is greater than .05). Knowing that changes to certain email content or timing won’t likely have an affect on your audience is just as useful for future communication strategies. If you still feel strongly that the first experiment wasn’t enough to capture the difference in your groups’ responses, then replicate the experiment to add to the strength of your results.

Organizing your results for future use

“As a lab scientist, I was taught to keep meticulous records of every experiment that I did. My professor once said to me, if you got hit by a bus or abducted by aliens I need to be able to reproduce and interpret what you did. As a marketer you probably don’t need to be that detailed but nonetheless it’s nice to have a record of what you have done so you can refer back to but more importantly share with your colleagues. For testing marketing campaigns, I kept a google doc, excel sheet, or a collection of paper napkins (true story). “

Keep a record of what the test was, the results, and the conclusions. And don’t be afraid to share your results in a presentation once a quarter. You immediately increase the value of your hard work by sharing your findings with your organization.

Here’s an example of a test result entry:

Aug 4, 2015

Test day of the week

Target Audience: All leads with job title = Manager, Director, VP

10,000 Leads

Email A - Send on Wednesday 10 AM

# Sent = 5,000

# Opens = 624

# Clicks = 65

# of Unsubscribes = 68

Email B - Send on Sunday 10 AM

# Sent = 5,000

# Opens = 580

# Clicks = 94

# of Unsubscribes = 74

P-value (Opens) = 0.176

P-value (Clicks) = .020

P-value (# Unsubscribes) = .612

Conclusion: Emails sent on Sunday resulted in more clicks, but there was not a difference in opens or unsubscribes.

If you clearly document and organize your test results, you’ll soon have a customer engagement reference guide that’s unique to your organization. And if you’ve designed your experiments as advised above, you’ll know that the conclusions drawn are based on sound statistical analyses of your data. Put those “fire and forget” Marketo A/B tests to rest and you’ll make your way towards optimal customer engagement.

What is your experience with Marketo’s A/B testing? Have you found any results that are interesting or unexpected? Feel free to share your experiences with testing.

I'd like to thank Nate Hall for co-authoring and editing this blog post.