AI Rules for A/B Testing Implementation

AI Implements A/B Tests with Math.random() — And It Is Wrong

AI implements A/B testing as: if (Math.random() > 0.5) { showNewVersion(); } else { showOldVersion(); }. This has three fatal flaws: the user sees a different variant on every page load (confusing UX and invalid data), there is no tracking of which variant the user saw (impossible to measure impact), and there is no statistical significance check (declaring a winner after 10 data points is not science — it is noise).

Proper A/B testing requires: deterministic assignment (the same user always sees the same variant), sufficient sample size (statistical power), significance testing (p-value or Bayesian credible interval), and clean measurement (one change per experiment, primary metric defined upfront). AI generates none of these.

These rules cover: deterministic assignment, sample size and significance, experiment platforms, variant tracking and analysis, and experiment lifecycle (start, run, conclude, clean up).

Rule 1: Deterministic Assignment — Same User, Same Variant

The rule: 'Assign users to variants deterministically using a hash: const bucket = murmurhash(userId + experimentId) % 100. bucket < 50 = control, bucket >= 50 = treatment. The same userId + experimentId always produces the same bucket — the user sees the same variant every time, across page loads, across sessions, across devices (if authenticated). Never use Math.random() for variant assignment.'

For anonymous users: 'For unauthenticated users, generate a stable anonymous ID: store in a cookie (ab_visitor_id) or localStorage. Use this ID for hashing. When the user authenticates, migrate their experiment assignments to their user ID — maintaining consistency across the anonymous-to-authenticated transition.'

For traffic allocation: 'Use the hash bucket for both: which experiment the user is in (only 20% of traffic participates) and which variant they see (50/50 split of participants). bucket 0-79 = not in experiment. bucket 80-89 = experiment control. bucket 90-99 = experiment treatment. This ensures: 20% participation, 50/50 split within participants, deterministic assignment.'

Hash(userId + experimentId) % 100 — deterministic, consistent, reproducible
Same user = same variant — across page loads, sessions, devices
Never Math.random() — users flip between variants, data is invalid
Anonymous: stable cookie ID — migrate to userId on authentication
Traffic allocation with bucket ranges: 0-79 excluded, 80-89 control, 90-99 treatment

⚠️ Random = Invalid Data

Math.random() gives user A the new checkout on load 1 and the old checkout on load 2. The user is confused, your data is meaningless. hash(userId + experimentId) gives user A the same variant every time — valid, measurable, consistent.

Rule 2: Statistical Significance — Not Gut Feeling

The rule: 'Define the primary metric and minimum detectable effect BEFORE starting the experiment. Calculate the required sample size: for a 2% conversion rate with 10% minimum detectable effect, you need ~16,000 users per variant. Run the experiment until you reach the required sample size — never peek and stop early. Use a significance threshold of p < 0.05 (frequentist) or 95% credible interval (Bayesian).'

For peeking: 'Do not check results daily and stop when you see significance — this inflates the false positive rate (peeking problem). Use sequential testing (group sequential design) if you need to peek: it adjusts the significance threshold for each look. Or use Bayesian methods — they handle early stopping more naturally than frequentist methods.'

AI declares winners after 50 data points — statistically meaningless noise. A proper experiment needs thousands of data points per variant. Use a sample size calculator before starting: input your baseline metric, minimum detectable effect, and significance level — it tells you how many users you need and how long to run.

💡 Calculate Before You Start

50 data points is noise, not signal. A 2% conversion rate needs ~16,000 users per variant for a 10% detectable effect. Use a sample size calculator. If you cannot get enough users, the experiment is not viable — do not run it.

Rule 3: Use an Experiment Platform

The rule: 'Use an experiment platform for all A/B tests: PostHog, Statsig, LaunchDarkly, Optimizely, or GrowthBook. These handle: deterministic assignment, variant tracking, statistical analysis, and result visualization. Never build a custom A/B testing system — the statistics, edge cases, and analysis tools are complex. The platform handles: sample size calculation, significance testing, variant assignment, and result dashboards.'

For integration: 'Initialize the experiment client at app startup. Check variant in code: const variant = experiment.getVariant("checkout-flow", userId); if (variant === "new") { ... }. Track the conversion event: experiment.track("purchase_completed", userId, { revenue }). The platform correlates: which users saw which variant, and which users converted — calculating the statistical significance automatically.'

AI builds custom A/B testing from scratch — assignment logic, tracking tables, analysis queries, and significance calculations. All of this exists in platforms that teams of statisticians have built and validated. Use the platform. Focus your engineering on the variants, not the infrastructure.

Rule 4: Track Everything — Variant, Exposure, Conversion

The rule: 'Track three events per experiment: exposure (user was shown the variant — not just assigned, but actually rendered), conversion (user completed the target action — purchase, signup, click), and context (user properties at exposure time — plan, device, country). Without exposure tracking, you cannot distinguish: users who were assigned but never saw the experiment vs. users who saw it and did not convert.'

For the exposure event: 'Fire the exposure event when the variant is actually rendered — not when assigned. If a user is assigned to the new checkout but never visits the checkout page, they should not be in the experiment analysis. Exposure-based analysis is more accurate than intent-to-treat (assignment-based) for web experiments.'

For guardrail metrics: 'Define guardrail metrics that must not degrade: page load time (new feature must not slow the page), error rate (new feature must not increase errors), and revenue (if testing a checkout change, revenue must not drop). If a guardrail metric degrades, stop the experiment — even if the primary metric improves.'

Track: exposure (variant rendered), conversion (action completed), context (user props)
Exposure on render, not assignment — users who never see the variant are excluded
Guardrail metrics: page speed, error rate, revenue — must not degrade
Primary metric defined upfront — do not change mid-experiment
Context at exposure: plan, device, country — enables segment analysis later

ℹ️ Exposure, Not Assignment

Tracking assignment counts users who never saw the variant. Tracking exposure (on render) counts only users who actually experienced it. Exposure-based analysis is more accurate for web experiments where not everyone visits the experimental page.

Rule 5: Experiment Lifecycle — Start, Run, Conclude, Clean Up

The rule: 'Experiment lifecycle: 1) Design: define hypothesis, primary metric, guardrails, sample size, and duration. 2) Implement: code behind feature flag, add tracking. 3) Run: start experiment, do not peek or change. 4) Analyze: reach sample size, check significance, check guardrails. 5) Decide: ship winner, revert loser, or iterate. 6) Clean up: remove losing code path, remove flag, document results.'

For documentation: 'Document every experiment result: hypothesis, variants, metrics, sample size, significance, and decision. This prevents: re-running experiments that already failed (waste), shipping changes that were already tested and lost (regression), and losing institutional knowledge about what works and what does not.'

AI implements experiments with no documentation, no lifecycle, and no cleanup. After the experiment concludes, both code paths remain — the flag is never removed, the losing variant is never deleted. Apply the same cleanup rules as feature flags: winner at 100% for 2 weeks = remove the flag and the losing code path.

Complete A/B Testing Rules Template

Consolidated rules for A/B testing.

Deterministic assignment: hash(userId + experimentId) — never Math.random()
Same user = same variant — across loads, sessions, devices
Sample size calculated upfront — never declare winner with <1000 data points
p < 0.05 or 95% credible interval — no peeking without sequential testing
Experiment platform: PostHog/Statsig/GrowthBook — never custom infrastructure
Track: exposure (on render), conversion (on action), guardrails (must not degrade)
Lifecycle: design → implement → run → analyze → decide → clean up
Document results — remove losing code — remove flag after conclusion