Everyone marketer and their grandma is running A/B tests today. But is it as simple as just testing one version against another and letting it run to say 5000 impressions, subtracting the difference and calling one of them a winner?
What are the parameters of your test and are the A/B test results bullet-proof?
Personally, I feel that it’s important to know what goes on under the hood if we are supposed to lean on this thing called “A/B testing” for most of our decision making.
With that in mind, I created this article to solidify everything I’ve learnt so far.
Null and Alternative Hypothesis
In the context of A/B testing, the null hypothesis is the view that whatever treatment you are giving to the test group has no effect on the metric you care about.
So for example, if you want to introduce a change to an ad and you think that this will lead to a change in CTR, then:
Your null hypothesis will be that this treatment will not impact the CTR and whatever difference you see between the CTR of the test and control group is just due to randomness.
Your alternative hypothesis is bascially the opposite view of the null hypothesis i.e. the treatment you’re making to the ad has an impact on CTR and that difference is not due to randomness.
Sample Size
Okay now that you’ve decided on your hypothesis, you want to determine your sample size, which is how many observations you need on each side of the test before you start calculating for statistical signficance.
In the world of frequentist hypothesis testing, the widely held view is that you do not measure significance until you hit the the minimum sample size. This is also known as fixed sample size testing.
The act of testing for significance before hitting minimum sample size is called peeking, here’s an article that goes into this in more detail. Peeking is generally frowned upon because it can cause you to make decisions based on half-baked results. I say generally because there are also arguments against the strict practice of not peeking, I will cover that further down in this post.
How big your sample size needs to be depend on three variables;
- Effect size/Minimum Detectable Effect
- Power
- Significance level
Effect Size/Minimum Detectable Effect
This is the difference you want to detect or be sensitive to in the difference between the test and control test statistic, in our case here the CTR.
Of course in an ideal world, we want to our test be able to detect the smallest of differences but the trade-off is that you’re going to need larger sample sizes in order to detect a smaller difference. Nothing is for free :)
A positive outcome of this is that it pushes you to test things that would result in a big lifts and not small incremental ones. So you might want to rethink testing the effect of adding an exclamation mark.
Power
Power is basically the probability that your experiment would be able to detect a difference in the test statistic (CTR) if there indeed was a difference between test and control.
A common power to use is 80% and by sheer intuition we know that requiring a higher power means a larger required sample size. Nothing’s for free, remember?
We can’t talk about power without going into Power Analysis.
A power analysis is simply a calculation for the required sample size taking into account the required power, effect size and significance level
Before we get into significance level, let’s talk about p-values.
What are p-values?
Perhaps the most popular term to be associated with A/B testing.
Going back to our example, the p-value is the probability that any CTR difference you’ve observed between the test and control group is due to chance, given that the null hypothesis is true.
(As a recap the null hypothesis is the view that your treatment has no impact on CTR).
Because it’s a probability measure, p-values range from 0.0 to 1.0. A low p-value lends credance to the alternative hypothesis that there is indeed an impact on CTR due to the treament in the test group i.e. there’s a low chance that any observed CTR difference is merely due to randomness and so whatever treatment we gave to the test ad must have ad real impact to the CTR.
But what counts as a low p-value?
This is a great segway into the idea of statistical significance.
Statistical Significance & Significance Level (α)
So what does it mean to be statistically signficant?
Well basically, it means the p-value is low enough for you to be confident saying that any CTR impact you see in the test group is not due to randomness.
But how low and against what, you ask?
The p-value is compared against the signficance level and the results are declared significant if they fall below an agreed significance level.
So what’s a good significance level to use?
A long time ago, some dude named Ronald Fisher proposed that we should use 0.05 and most people seemed to agree, so that has been the benchmark ever since. Basically it’s just a convention that stuck around.
Of course, no one is stopping you if you want to go for 0.01 and there are people who go for that especially in scientific circles because they want to keep false positives to a minimum but we need to be aware that reaching a 0.01 p-value means you need a much larger sample and in the context of paid marketing, that means it’s going to cost more time and/or money.
A/A Test
Like the name suggests, an A/A test is one where your control and test treatments are actually the same.
It’s good practice to run an A/A test first to make sure that the experimental set up does not falsely detect a difference when there is none. It’s also a good way to test if your test/control splitting methodology is sound.
If your experiment set-up does not register any significance in an A/A test then you’re good to go.
It’s also important to note that for a test at a significance level of 5%, there will be a 5% chance you get a significant result even though there is no real difference!
And this little nugget of info brings us to…
p-value hacking
As I mentioned above, for a 5% significance level A/A test where there is no actual difference between control and test, there will be a 5% chance you would get a significant result.
Based on that knowledge, p-value hacking is the process of continuously running your test until you hit that 5% chance of statistical significance. Sneaky!
There are some black sheeps in the scientific community out there who do just that to get published so don’t take any research paper at face value when they claim their tests are significant, what’s important is that their results are reproducable.
Confidence Intervals
Let’s say your control group’s CTR is 5% in a A/B test.
A 95% confidence interval (CI) between 4.5% and 5.5% means that if you ran this test infinite times, you would get a ad CTR between that interval 95% of the time.
Why do we need CIs and why does the metric vary? Well because of randomness, a fair coin most likely won’t give you 5 heads and 5 tails in 10 flips so we use CIs to communicate that randomness.
Alternate Viewpoints
Everything which I just describe above is the widely accepted school of thought.
There’s actually another take on how early/late you should stop your test. These guys who did a bunch of experiments over at Wikipedia have proposed another way to look at measuring the lift of an experiment.
In their example, instead of waiting for the minimum sample size, they stopped at the first sign of 70% confidence (p=0.3). And the chance of getting the test wrong is actually only 15% (assuming a 2 tailed test and you only care about the losing tail) - higher than your common 2.5% chance based on p=0.05 but they argued it’s not that much higher.
So what are they getting back in return if they stop the test earlier? Testing velocity
Their thesis is that even if they stopped earlier and exposed themselves to the risk of rolling out a loser, they would still be able to net out ahead due to a more aggressive test velocity. Furthermore, in their case they found that the chance of picking a loser despite stopping early is only 10%.
There’s definitely more in the world of A/B testing than what I just described but this should hopefully be a good basic primer.