Meta-analysis of A/B Testing Data

Meta-analysis of A/B Testing Data

This idea is closely related to this previously suggested idea: Select A/B Test Winner by Statistical Significance

First, it is important to reiterate that we should be using statistical tests to determine a winner for our A/B tests. By looking at raw numbers (e.g. total number of opens for version A and version B), we cannot determine if one version is actually different from the other. For example, is one version of a subject line test truly better than the other version with 1% more opens? What about 2%? What is the cut-off to determine a true winner? The best way for us to determine a winner is to run a statistical test.

Instead of going into great detail about statistical testing (since other users have addressed this issue), here are some articles about statistical testing when conducting A/B tests.

It is very possible that two versions of a test do not truly differ from one another in effectiveness. Because of this, I think that it would be desirable to have some sort of message following an A/B test that says there is not a true difference between the different versions. This gets back to the issue of whether a 17% open rate for version one vs. an 18% open rate for version two means that version two is a "winner". Of course, this would mean that we could not automatically send a winning version of a email to our target audience. However, we may actually be better off if we do an even 50-50 split in our target audience to test the two versions of our email.

Say we do a 50-50 split of a declarative vs. interrogative subject line test. After the emails have deployed, we can look at our results to determine if we have a true winner. If so, that is great, but that doesn't mean we are done testing. We will want to run multiple declarative vs. interrogative subject line tests and meta-analyze the results to see if our first result generalizes to other emails. If we tested declarative vs. interrogative subject lines on emails promoting a free sample, does that mean that we would have the same results on an email where the primary CTA was contact a sales rep? The best way to determine if our effect holds true across different email topics and target audiences, we should be conducting conceptual replications of our tests and meta-analyzing the data. An example of a conceptual replication in this case is when the content of the subject lines changes based on the type of email. Testing "Try X Product Today!" vs "Try X Product Today?" would be a conceptual replication of "Try Y Product Today!" vs "Try Y Product Today?". This is a much simplified example, but it makes the point that the same test is being conducted on a different email. This is important because it is statistically possible for one test to not show a significant difference, but the overall effect is significant when analyzed across all the tests conducted of that same sort (i.e. all declarative vs. interrogative subject line tests).

That being said, we should take pause when interpreting the outcomes of a single A/B test, and we should look to analyze all of the tests of the same sort to find what is really resonating with our audience. I would love for Marketo to include a meta-analysis function, perhaps in RCA or RCE. We can tag an A/B test as a certain type (like the drop down functionality when setting up our A/B test but with more detail to specify the exact type of test) and have Marketo run a large report to tell us if one version of our test performs significantly better than the other version across all of the tests of that type that we have done. I know that this can be a large undertaking, but it could be possible by writing out a script with a statistical software. There are plenty of open-source statistics programs that could accomplish this task, such as R.

1 Comment
Community Manager
Status changed to: Open Ideas