Wednesday, July 1, 2026
HomeData AnalyticsA/B Testing for Data Scientists: Complete Statistical Guide (2026)

A/B Testing for Data Scientists: Complete Statistical Guide (2026)

Table of Content

A/B testing is how data-driven companies make product decisions. Every major tech company runs thousands of experiments daily. This guide teaches you the full statistical framework — from sample size to final decision.

The A/B Testing Process

  1. Define the hypothesis: what change do you want to test?
  2. Choose your primary metric (conversion rate, revenue, CTR)
  3. Calculate required sample size before starting
  4. Randomly split users into control (A) and treatment (B)
  5. Run until sample size is reached — do not stop early
  6. Analyse with a statistical test and make a decision

Sample Size Calculation

from scipy import stats
import numpy as np

def required_sample_size(baseline, mde, alpha=0.05, power=0.8):
    p1, p2 = baseline, baseline + mde
    z_a = stats.norm.ppf(1 - alpha/2)
    z_b = stats.norm.ppf(power)
    p_pool = (p1 + p2) / 2
    n = ((z_a * np.sqrt(2*p_pool*(1-p_pool)) + z_b * np.sqrt(p1*(1-p1)+p2*(1-p2)))**2
         / (p2-p1)**2)
    return int(np.ceil(n))

n = required_sample_size(baseline=0.10, mde=0.02)
print(f'Need {n:,} users per group ({n*2:,} total)')

Simulate and Run the Test

np.random.seed(42)
n = 2000
control   = np.random.binomial(1, 0.10, n)
treatment = np.random.binomial(1, 0.12, n)

print(f'Control:   {control.mean()*100:.2f}% ({control.sum()}/{n})')
print(f'Treatment: {treatment.mean()*100:.2f}% ({treatment.sum()}/{n})')

Statistical Test

from scipy.stats import proportions_ztest

counts = [treatment.sum(), control.sum()]
nobs   = [n, n]
z_stat, p_value = proportions_ztest(counts, nobs)

print(f'p-value: {p_value:.4f}')
if p_value < 0.05:
    lift = (treatment.mean() - control.mean()) / control.mean() * 100
    print(f'Significant! Lift: +{lift:.1f}%')
else:
    print('Not significant — cannot conclude B is better')

Common Mistakes

  • Peeking: Stopping early when you see significance inflates false positives
  • Multiple metrics: Testing many metrics simultaneously — use Bonferroni correction
  • Novelty effect: Run long enough for initial curiosity to wear off
  • Ignoring practical significance: A tiny lift may be statistically significant but not worth shipping

FAQ

How long should I run an A/B test?

Until your pre-calculated sample size is reached — never stop early because results look good. At minimum, run for one full business cycle (7 days) to account for weekday/weekend effects.

What if my test is inconclusive?

Inconclusive is a valid result. It means the effect (if any) is smaller than your minimum detectable effect. Either accept the null, collect more data, or revise your hypothesis.

Leave feedback about this

  • Rating

Latest Posts

List of Categories