A/B Testing Statistics: Sample Size, Significance, and When to Stop Your Test

Published: January 10, 2025 • 11 min read • Category: Testing

The majority of A/B tests run by marketing teams are statistically invalid — not because teams are unsophisticated, but because A/B testing statistics are counterintuitive and the tooling often obscures what is really happening. The result: businesses make confident decisions based on tests that were stopped too early, run on too little traffic, or measured the wrong thing. This guide explains the statistics clearly enough to run valid tests and trust the results.

Use our free A/B test planner to calculate exact sample sizes and test durations for your specific situation.

What Statistical Significance Actually Means

When we say a result is "significant at 95% confidence," we mean: if there were truly no difference between the variants, there is only a 5% probability of observing a difference this large or larger due to random variation. It does not mean the result is definitely real. At 95% confidence, 1 in 20 statistically significant results will be false positives. This is mathematically unavoidable — knowing it changes how you interpret and act on results.

The Four Numbers You Must Set Before Running Any Test

1. Baseline Conversion Rate

Your current conversion rate on the control page. The lower your baseline, the larger your required sample. A page converting at 1% requires approximately 4x the sample size of a page converting at 5% to detect the same absolute improvement. Measure your baseline for at least two weeks before starting a test.

2. Minimum Detectable Effect (MDE)

The smallest improvement you care enough to detect reliably. This is a business decision, not a statistical one. For high-traffic pages (100k+ monthly visitors), a 10–15% relative improvement is practical. For medium-traffic pages (10k–100k), aim for 20–30% relative. For low-traffic pages under 10k monthly visitors, A/B testing is often not feasible — use AI analysis instead.

3. Statistical Power (80%)

Power is the probability that your test will detect a real effect of your MDE size when one actually exists. The standard is 80% — meaning 20% of real improvements at your MDE size will not be detected (false negatives). Increasing power to 90% or 95% significantly increases required sample size.

4. Significance Level (5%)

The false positive rate you accept. The industry standard is 5% (95% confidence). For high-stakes decisions — pricing changes, major redesigns — this is appropriate. For quick iterative tests, some teams use 10% confidence to reduce required sample size, accepting the tradeoff of a higher false positive rate.

Sample Size Calculation

At 95% confidence, 80% power, testing a 20% relative improvement from a 3% baseline (3% → 3.6%), you need approximately 7,600 visitors per variant — 15,200 total. The formula is: n = (Z_α/2 + Z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)². Our A/B test planner calculates this automatically — enter your baseline rate, traffic volume, and MDE to get exact numbers.

Test Duration: Why Sample Size Alone Is Not Enough

You must run tests for a minimum of 2 full business cycles (typically 2 weeks, ideally 4) regardless of when you hit sample size. B2B pages convert differently on Tuesday than Saturday. E-commerce pages behave differently on payday dates. The novelty effect — returning visitors converting differently on a new variant simply because it is new — diminishes after 5–7 days. A test stopped at day 3 may be capturing novelty, not genuine preference.

The Four Most Dangerous Statistical Mistakes

Mistake 1: Peeking and Stopping Early

Checking results and stopping as soon as you see p < 0.05 inflates your false positive rate from 5% to approximately 40% if you check 10 times during the test. This is the most widespread mistake in A/B testing. Fix: set a fixed end date before the test starts. Do not act on results before it.

Mistake 2: Multiple Variants Without Correction

Running A/B/C/D tests (4 variants) without adjusting your significance threshold. With 6 pairwise comparisons, even at 95% confidence per comparison, the probability of at least one false positive exceeds 26%. Fix: apply Bonferroni correction (divide α by the number of comparisons) or run sequential A/B tests.

Mistake 3: Measuring the Wrong Conversion Event

Optimising for scroll depth, time on page, or click-through rate when your real goal is purchases. Improving an intermediate metric can leave actual revenue unchanged. Fix: define your primary metric as the action that generates direct business value.

Mistake 4: Ignoring Segment-Level Results

An overall 15% conversion improvement that masks a 40% desktop improvement and a 5% mobile decline is not a win — it is a masked mobile disaster. Always break results down by device type, traffic source, and new vs returning visitors before declaring a winner.

When A/B Testing Is Not the Right Tool

A/B testing requires sufficient converting traffic. At 1,000 monthly visitors converting at 2%, reaching statistical significance for a 25% relative improvement at 80% power takes approximately 8 months. That is too slow to be useful.

For low-traffic pages, use AI-powered analysis to identify and fix the most obvious issues first. Once traffic and conversion rates improve, A/B testing becomes feasible. The most efficient approach: AI analysis identifies the top 3–5 opportunities, A/B testing validates the specific implementation. Read our comparison of AI analysis vs A/B testing to understand when each wins.

Next Steps