How to A/B Test AI Rules

Replace Guesswork with Experiments

A common rule debate: 'Should the error handling rule use the Result pattern or try-catch with custom error classes?' Both sides: have reasonable arguments. Without an experiment: the debate continues indefinitely, or the loudest voice wins. With an A/B test: deploy the Result pattern to Team A's projects. Deploy try-catch to Team B's projects. After 2 weeks: compare code review time, defect rate, and developer satisfaction between the groups. The data: resolves the debate objectively.

When to A/B test: when two approaches are both reasonable and the team cannot decide (error handling patterns, testing strategies, file structure conventions), when a proposed rule change is controversial (some developers strongly favor the change, others strongly oppose it), and when the expected impact is significant (the rule affects a large portion of AI-generated code). Do NOT A/B test: trivial rules (naming preferences), security rules (always deploy the more secure option), or rules with one clearly superior option (no experiment needed).

The A/B test requirement: at least 2 teams of similar size and technology stack. If you have only 1 team: use a before/after comparison instead (measure with the old rule, then switch and measure with the new rule). A/B testing: more rigorous (controls for other variables) but requires more teams. Before/after: simpler but less rigorous. AI rule: 'A/B testing: for organizations with multiple comparable teams (20+ developers). Before/after: for smaller teams or single-team projects.'

Step 1: Design the Experiment (15 Minutes)

Define the variants: Variant A (the current rule — the control) and Variant B (the proposed change — the treatment). Both variants: identical except for the specific rule being tested. Example: Variant A: 'Error handling: use try-catch with custom AppError classes.' Variant B: 'Error handling: use the Result<T, E> pattern. Never throw in business logic.' Everything else in the rule file: identical between variants.

Define the metrics: what will you measure to determine the winner? Primary metric: the one that determines the outcome (e.g., code review time for error-handling-related comments). Secondary metrics: additional signals (defect rate, developer satisfaction, AI compliance score). Define the success threshold: Variant B wins if it improves the primary metric by at least 10%. If improvement is less than 10%: the difference is not significant enough to justify the migration cost.

Assign teams: randomly assign comparable teams. Team A (3 developers, TypeScript backend): gets Variant A. Team B (3 developers, TypeScript backend): gets Variant B. The teams: should be similar in size, technology, and project complexity. If teams differ significantly: the results may reflect team differences rather than rule differences. Duration: 2-4 weeks (long enough to collect meaningful data, short enough to not burden teams with a potentially inferior variant). AI rule: 'Similar teams, same duration, one variable changed. The experiment isolates the rule's impact from other factors.'

💡 Define the Success Threshold Before Starting

Without a threshold: 'Variant B is 3% better on review time.' Is that significant? Worth migrating 50 repos? Probably not. With a threshold: 'Variant B wins if it improves the primary metric by at least 10%.' The 3% result: below threshold. Decision: keep the current rule (the improvement is not worth the migration cost). The threshold: prevents deploying changes that are statistically better but practically insignificant.

Step 2: Run the Experiment (2-4 Weeks)

Deploy the variants: create two versions of the CLAUDE.md (or the relevant section). Deploy Variant A to Team A's repos. Deploy Variant B to Team B's repos. Both teams: aware they are part of an experiment (transparency — not a secret test). They know: the experiment compares two error handling approaches. They do not know: which variant is expected to win (to avoid bias). AI rule: 'Transparent experiments: both teams know they are testing. Blind to the hypothesis: neither team knows which variant is expected to be better.'

Collect data continuously: during the experiment, track: PR review time (from GitHub/GitLab analytics — automated), code review comments about error handling (manual categorization or keyword search), defect rate (from issue tracker — bugs related to error handling), AI compliance (weekly test prompt — does the AI follow the variant's rule?), and developer feedback (Slack comments, informal check-ins). AI rule: 'Automate metric collection where possible. Manual tracking: for nuanced metrics like review comment categorization. The less manual work: the more reliable the data.'

Do not intervene: during the experiment, do not: update either variant's rules (changing the variable invalidates the experiment), provide extra training to either team (differential training biases the results), or tell teams which variant you expect to win (expectation bias affects behavior). If a team encounters a genuine issue with their variant (a bug, a conflict): document the issue and factor it into the results — but do not switch their variant mid-experiment. AI rule: 'The experiment runs undisturbed for the full duration. Interventions: invalidate the results.'

⚠️ Do Not Intervene During the Experiment

Week 2 of a 4-week experiment. Team B (Variant B) reports: 'The Result pattern is confusing for our Express middleware.' You: provide extra training to Team B. The result: Team B's metrics improve — but because of the training, not because of the rule variant. The experiment: invalidated. If a team has issues: document them. Factor them into the analysis. But do not provide differential support. Both teams: same support level throughout.

Step 3: Analyze Results and Decide

Compare metrics: after the experiment ends, compare Team A's metrics against Team B's. For each metric: calculate the difference and whether it exceeds the success threshold. Example: Variant B (Result pattern): review time 2.1 hours (vs Variant A: 2.8 hours — 25% improvement). Defect rate: 0.8 per sprint (vs 1.2 — 33% improvement). Developer satisfaction: 4.3/5 (vs 3.9/5 — 10% improvement). All metrics: favor Variant B. Decision: adopt Variant B for the entire organization.

When results are mixed: Variant B: better review time but worse developer satisfaction. The primary metric: determines the decision (if review time was the primary metric: adopt Variant B). The secondary metrics: inform the adoption approach (lower satisfaction means: provide training and rationale when deploying Variant B to all teams). If results are inconclusive (less than 10% difference on the primary metric): the rule change is not impactful enough to justify the migration. Keep the current rule.

Document and deploy: write up the experiment results: the variants tested, the metrics collected, the outcome, and the reasoning for the decision. Share with the team. Deploy the winning variant to all repos. The experiment documentation: becomes part of the rule's rationale ('This error handling rule was validated through a 4-week A/B test. Variant B outperformed Variant A by 25% on review time and 33% on defect rate.'). AI rule: 'The experiment: produces evidence. The documentation: preserves the evidence. Future debates about the rule: reference the experiment results instead of re-debating from scratch.'

ℹ️ Experiment Results Become Rule Rationale

The rationale for the Result pattern rule: 'Validated by a 4-week A/B test (March 2026). The Result pattern outperformed try-catch by 25% on review time and 33% on defect rate across 2 comparable teams. Experiment documentation: docs/experiments/error-handling-ab-test.md.' This rationale: is stronger than any opinion or theoretical argument. When someone later questions the rule: the experiment results answer the question. Evidence-based rules: rarely contested.

A/B Testing Summary

Summary of A/B testing AI rules.

When: two reasonable approaches, controversial changes, significant expected impact. Not for trivial or security rules
Design: one variable changed. Similar teams. 2-4 week duration. Success threshold defined upfront (e.g., 10%)
Variants: control (current rule) vs treatment (proposed change). Everything else identical
Execution: transparent to teams. Blind to hypothesis. No interventions during the experiment
Metrics: primary (determines outcome) + secondary (informs adoption approach). Automate collection
Analysis: compare metrics against success threshold. Mixed results: primary metric decides
Inconclusive: <10% difference = not worth the migration cost. Keep current rule
Documentation: experiment results become rule rationale. Prevents re-debating decided questions

Replace Guesswork with Experiments

Step 1: Design the Experiment (15 Minutes)

Step 2: Run the Experiment (2-4 Weeks)

Step 3: Analyze Results and Decide

A/B Testing Summary

Related articles