How to Track the Impact of Individual Rules

Which Rules Actually Matter?

Your rule file has 35 rules. Aggregate metrics show: review time decreased 30%, defect rate dropped 25%. But: which of the 35 rules are responsible? Are all 35 contributing equally? Or are 5 rules doing most of the work while 30 are dead weight? Per-rule impact tracking: answers these questions. It attributes quality improvements to specific rules, identifies which rules are high-impact (protect them), which are neutral (keep or simplify), and which are negative (remove — they add complexity without improving outcomes).

The attribution challenge: quality improvements come from multiple rules acting together. It is hard to isolate one rule's contribution from the collective effect. But: practical attribution does not need to be perfect. It needs to be directional — which rules are most important and which are least? The techniques: before-after measurement (measure metrics before and after adding a specific rule), correlation analysis (which rules correlate with quality improvements?), and developer feedback (which rules do developers say are most helpful?).

When to track: during the quarterly review (assess each major rule's contribution), after adding a significant new rule (measure its specific impact), and when considering removing a rule (is the rule currently contributing?). Not needed for: minor rules (the effort of tracking exceeds the value of the insight) or universally accepted rules (parameterized queries — everyone agrees this prevents SQL injection, no measurement needed).

Step 1: Before-After Measurement for Individual Rules

When a new rule is added: measure the relevant metric before and after. Adding an error handling rule: measure error-handling-related review comments before (baseline: 3.2 per PR) and after (treatment: 0.8 per PR). The delta: 2.4 fewer comments per PR, attributed to the error handling rule. This is the most direct attribution method: one rule changes, one metric changes, the delta is the rule's impact.

For existing rules: use the removal test. Temporarily remove a rule and measure the impact. If the AI-generated code quality drops: the rule was contributing. If nothing changes: the rule was not affecting AI output (it may be too vague for the AI to follow, or it may be redundant with other rules). The removal test: the most rigorous per-rule measurement. But: requires temporarily degrading AI output, which may not be acceptable for production teams. Use in: non-critical environments or during dedicated assessment periods.

Proxy measurement: if before-after is not feasible, use proxy metrics. For each rule: how many times was it the reason for a review comment before it existed? (Estimated from historical PR data.) How many times is it overridden now? (Low override = developers accept it, suggesting value.) How often does the AI follow it? (High compliance = the AI is using it, suggesting it affects output.) These proxies: less rigorous than before-after but practical for all rules. AI rule: 'Before-after: most rigorous. Removal test: most definitive. Proxy metrics: most practical. Use the method that fits the rule's importance and the team's capacity.'

💡 Before-After Is the Most Direct Attribution

Adding the error handling rule: measure error-handling review comments before (3.2/PR) and after (0.8/PR). The delta: 2.4 comments/PR, attributed directly to the rule. No other variable changed. The attribution: clean and defensible. Before-after: the gold standard for individual rule impact. Use it every time a significant rule is added. The baseline measurement: takes 5 minutes (count comments in the last 10 PRs). The after measurement: the same count 2 weeks later.

Step 2: The Simple Attribution Model

For each rule: assign scores across 3 dimensions. Reach (1-5): how much code does this rule affect? A naming convention: affects every file (reach: 5). A database migration rule: affects only migration files (reach: 2). Compliance (1-5): does the AI follow this rule? Test with a prompt. High compliance: 5. Low: 1. Perceived value (1-5): do developers find this rule helpful? From survey data or informal feedback. The composite score: Reach × Compliance × Perceived Value. Maximum: 125 (5×5×5). Minimum: 1 (1×1×1).

The composite score: ranks rules by their overall contribution. A security rule with: reach 5 (affects all code) × compliance 5 (AI always follows) × perceived value 5 (developers agree it prevents vulnerabilities) = 125. An import ordering rule with: reach 5 × compliance 4 × perceived value 2 (nobody cares about import order — Prettier handles it) = 40. The security rule: 3x more impactful. The score: makes the comparison concrete and quantitative.

Acting on scores: top 10 rules (highest composite): protect — never weaken without strong justification. These rules: deliver the most value. Bottom 10 rules (lowest composite): evaluate for removal or improvement. Low reach: is the rule too narrow? Low compliance: is the rule too vague? Low perceived value: does the rule address a real problem? Each dimension's score: suggests a specific improvement action. AI rule: 'The composite score: a simple model that provides actionable ranking. Not perfect attribution — but directionally correct and practically useful.'

⚠️ Not Every Rule Needs Impact Tracking

Tracking impact for all 35 rules: 35 × before-after measurements = weeks of work. Not worth it. Track impact for: the top 5 most important rules (are they working?), newly added rules (did the addition help?), and rules being considered for removal (are they contributing?). The remaining rules: covered by the simple composite score (Reach × Compliance × Value), which takes 30 seconds per rule to estimate. Full attribution: for high-stakes decisions. Quick scoring: for everything else.

Step 3: Tracking Impact Over Time

Quarterly impact snapshots: score each rule quarterly. Track the composite score over time. Rules that consistently score high: the backbone of your rule set. Rules with declining scores: may be losing relevance (the codebase evolved and the rule is now less impactful) or compliance (the AI's behavior changed and the rule is less effective). Rules with improving scores: becoming more impactful (the team adopted the pattern more broadly, increasing reach).

Impact-based prioritization: when the platform team has limited time for rule maintenance, prioritize by impact. High-impact rules with declining compliance: fix first (these rules are important but the AI is not following them — refine the wording). High-reach rules with low perceived value: investigate (are they really unimportant, or do developers not realize their value?). Low-impact rules with high maintenance cost: remove (they are not worth the effort to maintain).

Communicating impact: in the quarterly review, present the top 5 highest-impact rules and the bottom 5. The top 5: 'These rules prevent the most issues and save the most review time. Here are their scores.' The bottom 5: 'These rules have the lowest impact. Recommendation: remove #32 and #35 (low reach, low value). Improve #28 (high reach but low compliance — the rule needs better wording).' The impact data: makes rule decisions evidence-based, not opinion-based. AI rule: 'Impact tracking: transforms rule maintenance from subjective debate to evidence-based decisions. The scores: settle arguments that otherwise go in circles.'

ℹ️ Impact Scores Settle Arguments That Go in Circles

The team debates: 'Should we keep the import ordering rule?' Pro: 'It makes the code consistent.' Con: 'Nobody cares about import order.' The debate: circular, based on opinions. With impact scores: 'The import ordering rule: reach 5 (all files), compliance 4 (AI follows it), perceived value 2 (developers are neutral). Composite: 40/125. Compare to the error handling rule at 125/125. The import ordering rule: contributes 3x less than the error handling rule. If we are decluttering: this is a candidate for removal.' The score: resolves the debate with data.

Rule Impact Tracking Summary

Summary of tracking individual AI rule impact.

Purpose: identify which rules contribute most. Protect high-impact. Remove low-impact. Improve mid-range
Before-after: measure metric before and after adding a rule. The delta = the rule's contribution
Removal test: temporarily remove a rule. Quality drops = the rule was contributing. No change = dead weight
Proxy metrics: reach (how much code affected), compliance (AI follows it), perceived value (developer opinion)
Composite score: Reach × Compliance × Value. Max 125. Ranks rules by overall contribution
Top 10: protect — the backbone of the rule set. Bottom 10: evaluate for removal or improvement
Quarterly snapshots: track scores over time. Declining = losing relevance. Improving = gaining traction
Communication: present top 5 and bottom 5 at quarterly review. Evidence-based rule decisions