A/B testing is how data-driven companies make product decisions. Every major tech company runs thousands of experiments daily. This guide teaches you the full statistical framework — from sample size to final decision.
The A/B Testing Process
- Define the hypothesis: what change do you want to test?
- Choose your primary metric (conversion rate, revenue, CTR)
- Calculate required sample size before starting
- Randomly split users into control (A) and treatment (B)
- Run until sample size is reached — do not stop early
- Analyse with a statistical test and make a decision
Sample Size Calculation
from scipy import stats
import numpy as np
def required_sample_size(baseline, mde, alpha=0.05, power=0.8):
p1, p2 = baseline, baseline + mde
z_a = stats.norm.ppf(1 - alpha/2)
z_b = stats.norm.ppf(power)
p_pool = (p1 + p2) / 2
n = ((z_a * np.sqrt(2*p_pool*(1-p_pool)) + z_b * np.sqrt(p1*(1-p1)+p2*(1-p2)))**2
/ (p2-p1)**2)
return int(np.ceil(n))
n = required_sample_size(baseline=0.10, mde=0.02)
print(f'Need {n:,} users per group ({n*2:,} total)')Simulate and Run the Test
np.random.seed(42)
n = 2000
control = np.random.binomial(1, 0.10, n)
treatment = np.random.binomial(1, 0.12, n)
print(f'Control: {control.mean()*100:.2f}% ({control.sum()}/{n})')
print(f'Treatment: {treatment.mean()*100:.2f}% ({treatment.sum()}/{n})')Statistical Test
from scipy.stats import proportions_ztest
counts = [treatment.sum(), control.sum()]
nobs = [n, n]
z_stat, p_value = proportions_ztest(counts, nobs)
print(f'p-value: {p_value:.4f}')
if p_value < 0.05:
lift = (treatment.mean() - control.mean()) / control.mean() * 100
print(f'Significant! Lift: +{lift:.1f}%')
else:
print('Not significant — cannot conclude B is better')Common Mistakes
- Peeking: Stopping early when you see significance inflates false positives
- Multiple metrics: Testing many metrics simultaneously — use Bonferroni correction
- Novelty effect: Run long enough for initial curiosity to wear off
- Ignoring practical significance: A tiny lift may be statistically significant but not worth shipping
FAQ
How long should I run an A/B test?
Until your pre-calculated sample size is reached — never stop early because results look good. At minimum, run for one full business cycle (7 days) to account for weekday/weekend effects.
What if my test is inconclusive?
Inconclusive is a valid result. It means the effect (if any) is smaller than your minimum detectable effect. Either accept the null, collect more data, or revise your hypothesis.



