AB Testing anything in Marketo

Balkar_Singh · ‎04-20-2025

What is AB Testing?

A/B Testing is a controlled experiment to compare two versions of something (A vs B) to determine which performs better. It's used to optimize conversions, engagement, or any metric that matters.

How it works?

Identify a goal – increase email CTR, landing page signups, form submissions etc.
Create two variants – A (current/default), B (new/test).
Split the audience – Randomly show A to 50%, B to 50%.
Measure performance – Track key metrics.
Analyze results – Use statistical significance to decide if B beats A.

When to use AB testing?

In nutshell, AB testing represents an experiment. It is an iteration with an hypothesis (assumption) - that the new variant is better. The job of AB testing is to accept or reject this hypothesis. A/B testing helps you know what works better - not what you think works better. Hence, use it when you are looking for improvements through iterations. There can be many scenarios.

Concepts (13).png

However - not everything needs A/B testing. The decision should be strategic. For example, if you're localizing content for a new region, there's already strong evidence that localization improves engagement. In such cases, skip the test and apply what’s proven. A/B testing is best reserved for decisions where data is unclear or stakes are high.

Test when the stakes are high or when you face ambiguity. But skip it when the answer is obvious.

Using inbuilt AB testing features

Emails and Landing pages have a very straightforward feature for AB testing. You create variants and you get the results within Marketo - followed by your interpretation of results to conclude with statistical confidence. Below is a simplified overview of how to AB test emails in Marketo. For detailed steps, refer documentation.

AB Test - Frame 2.jpg

Which test type to select?

When it comes to email optimization, the simplest and most effective strategy is to test one variation at a time. Whether you're experimenting with the subject line, from address, or send date/time, isolating each element helps you understand what truly impacts performance.

But don’t stop there.

If you want to go beyond the basics, explore more creative variables - like content style, placement of your call-to-action, or even the overall layout. In such cases, selecting the ‘Whole Emails’ option lets you test multiple elements together for a broader impact.

Whether testing subject lines or entire algorithms for scoring leads, the core principle stays the same: isolate one variable, run a controlled test, and let significance guide you.

What sample size to select?

As much as you can while also giving it up for actual campaign (after the results) - In other words, test enough to learn, but not so much that you have nothing left to act on. Not below 1000 records. Also, not beyond 10-15K records unless you understand the data science behind it.

Analyzing AB testing results

Let's assume that we did an AB test on Email A and Email B to test deliverability. We had an hypothesis that 'From email address' may be causing deliverability drops in Email A. Hence, we made Email B with the only change in 'From email address' and got the following results.

AB Test (3).png

Is Email 2 really better? This is the tricky (and fascinating) part about A/B testing.

Email 2 has better deliverability.
However, that doesn’t automatically mean it's the better option overall.
It could just be a random fluctuation - a normal part of testing.
This is where A/B testing gets super technical, and some data science concepts become unavoidable.

For most marketers, we tend to focus more on the creative side. But if you want to confidently conclude that Email 2’s better performance is due to using a different “From” address, you’ll need something called statistical significance.

Don't be fooled by random success. Statistical significance ensures you're seeing real impact - not just noise.

I’ll leave a deeper explanation to someone in the comments who can break it down better. But in a nutshell - Statistical significance helps determine whether a result happened for a real reason - or was it just random? It is a data science concept.

AB testing lead scoring

An AB testing strategy which requires variants of emails or landing pages benefits from in built features. However, AB testing concept itself, isn't a Marketo feature and can be applied to more areas.

Think of scoring. What is the overall measure of good scoring? My understanding is that it's a good predictor of SAL (Sales accepted leads) - That is, if someone has better score, it's more likely to reach an SAL stage. However, things get more interesting here. When AB testing emails, we know we will have most data within about 2-3 days. Most of us see our emails within a few days. But conversions are longer. We don't control it. Someone may MQL today and SAL next year. Hence, besides a good sample size, AB tests also need to run for a significant amount of time.

Let's take an example where we want to update scoring algorithm by considering relative importance more seriously. Existing scoring has flat increments. However, it doesn't reflect the nuance of relative importance. A form fill is surely more important than a web page visit. But this difference might not be 5 points. In other words, Scoring B accounts for relative importance of scores more seriously.

Engagement Metric	Scoring A (flat increments)	Scoring B (relative importance)
Email Open	5	1
Clicked link in email	10	5
Visited web page	15	8
Form fill	20	6
Contact Us Form Fill	25	9

After testing the new model for a year, let's assume we get the following results.

Variant A (Scoring A) Results

Leads	MQL Count	SAL in MQL	SAL / MQL Ratio
1,000	200	14	7.0 %

Variant B (Scoring B) Results

Leads	MQL Count	SAL in MQL	SAL / MQL Ratio
1,000	180	18	10.0 %

Looks like Scoring B is good. However is it statistically significant? We might not be data scientists, however this is where to we can use AI. Let's check with ChatGPT. I uploaded the above data to ChatGPT and asked the following.

If SAL/MQL ratio is the metric which we want to improve and we did an AB test on variants - Scoring A and B. Are these results statistically significant?

Response Snapshot

Judgement and next steps

While the result may not be statistically significant - in this case, my intuition wants to test more. A difference of 3% is too good to dismiss. If the result is NOT statistically significant, it means you don’t have enough evidence to say the variation caused the difference. That’s it.

It doesn’t mean that variation had no effect. Maybe it did, maybe it did not.

Testing isn’t only about picking winners. It’s about learning, iterating, and understanding your audience. Even a "losing" variation teaches you something valuable.

Applying AB testing to different scenarios give you data-backed information which can guide more precise strategies, and hence result in measurable improvements. Where are you using AB testing?