How to Score AI Rule Effectiveness

Not All Rules Are Equally Effective

Your rule file has 30 rules. Some: prevent bugs daily (high impact). Others: enforce a preference nobody cares about (low impact). Some: the AI follows perfectly (high compliance). Others: the AI ignores or misinterprets (low compliance). Some: developers appreciate (high satisfaction). Others: developers constantly override (low satisfaction). Per-rule scoring: identifies which rules deliver value and which consume space without contributing. The 80/20 principle: 20% of your rules likely produce 80% of the quality improvement.

The scoring purpose: protect high-impact rules (never remove or weaken them without strong justification), improve low-compliance rules (refine wording, add examples, resolve conflicts), investigate low-satisfaction rules (are they too rigid? Irrelevant? Misunderstood?), and remove low-impact rules (declutter the file, free context for more impactful rules). The scorecard: turns rule maintenance from guesswork into data-driven decisions.

When to score: during the quarterly rule review. The scoring: takes 30-45 minutes for a 30-rule file. The output: a ranked list of rules by effectiveness. The action: protect the top 5, improve the bottom 5, investigate anything with mismatched scores (high compliance but low satisfaction = the AI follows it but developers dislike it).

Step 1: The Four Scoring Dimensions

Dimension 1 — AI compliance (1-5): does the AI follow this rule consistently? 5: the AI follows it every time, no overrides needed. 3: the AI follows it sometimes — the output is inconsistent. 1: the AI ignores it — the output shows no evidence of the rule. Measurement: run 3 test prompts that should trigger the rule. Score based on how consistently the AI generates compliant code. AI rule: 'AI compliance is the foundation. A rule the AI does not follow: is a rule in name only.'

Dimension 2 — Override rate (1-5, inverted): how often do developers override this rule? 5: never overridden (0-5% override rate). 3: sometimes overridden (10-20%). 1: frequently overridden (30%+). Measurement: from override tracking data, or from the quarterly developer survey ('Which rules do you override most?'). High override rate: the rule does not fit real-world usage. AI rule: 'Override rate: the developer's vote on the rule. High overrides = the developers disagree with the rule, regardless of its theoretical merit.'

Dimension 3 — Developer satisfaction (1-5): do developers find this rule helpful? 5: developers cite it as one of the most helpful rules. 3: developers have no opinion (neutral). 1: developers cite it as frustrating or counterproductive. Measurement: quarterly survey ('Rate each rule category: very helpful, helpful, neutral, unhelpful, very unhelpful'). AI rule: 'Satisfaction: the leading indicator. Low satisfaction predicts: increasing overrides, decreasing adoption, and eventual rule abandonment.'

Dimension 4 — Impact (1-5): how much does this rule improve code quality? 5: prevents a class of bugs (security rules, data integrity rules). 3: improves consistency (naming conventions, import ordering). 1: enforces a preference with no measurable quality impact. Measurement: estimated by the tech lead based on: the severity of issues the rule prevents, the frequency of the pattern in the codebase, and the before/after quality difference. AI rule: 'Impact: the hardest dimension to measure but the most important. A high-impact rule that is hard to follow: worth the effort to improve. A low-impact rule that is easy to follow: still may not earn its place.'

💡 Run 3 Test Prompts per Rule for Compliance Scoring

To score AI compliance for the error handling rule: run 3 prompts that should trigger it. Prompt 1: 'Create a service function that fetches data.' Prompt 2: 'Handle the case where the database query fails.' Prompt 3: 'Create an API endpoint with input validation.' All 3 generate the correct error handling pattern: compliance = 5. Two of 3 correct: compliance = 3. None correct: compliance = 1. Three prompts: sufficient to score one rule. 30 rules × 3 prompts = 90 prompts. But: batch by rule category (test 5 rules with one prompt that exercises them all).

Step 2: The Rule Effectiveness Scorecard

For each rule: score all 4 dimensions (1-5). Calculate the aggregate: average of the 4 scores. Rank all rules by aggregate score. The result: a ranked list showing your most effective rules (high aggregate) to least effective (low aggregate). Present as a table: Rule | AI Compliance | Override Rate | Satisfaction | Impact | Aggregate.

Interpreting the scorecard: top quartile (aggregate 4.0+): your best rules. Protect them. These: generate the most value. Bottom quartile (aggregate below 2.5): your weakest rules. Investigate each one: is it fixable (improve wording, add examples) or should it be removed (low impact, high friction)? Middle range: functioning rules that could be improved. Address after the top and bottom are handled.

Mismatched scores reveal opportunities: high compliance + low satisfaction: the AI follows the rule, but developers dislike it (the rule may be correct but the rationale is not communicated — add the why). High impact + low compliance: an important rule the AI does not follow (refine the rule's specificity or add examples). Low impact + high compliance: the AI follows it perfectly, but it does not matter (candidate for removal to declutter the file). AI rule: 'Mismatched scores: the most actionable findings. They reveal: fixable problems with specific solutions.'

⚠️ Low Impact + High Compliance = Candidate for Removal

Rule: 'Sort imports alphabetically within each group.' AI compliance: 5 (the AI sorts perfectly). Override rate: 5 (nobody overrides). Satisfaction: 3 (neutral — nobody cares). Impact: 1 (alphabetical vs non-alphabetical imports: zero quality difference). Aggregate: 3.5 (seems OK). But: this rule consumes AI context for zero quality impact. Prettier or ESLint can handle import sorting. The AI's context: better used for error handling, security, or architecture rules. Remove the low-impact rule. Free the context.

Step 3: From Scores to Actions

For each rule in the bottom quartile: determine the action. Low compliance (AI does not follow): improve specificity, add code examples, resolve conflicts with other rules, or test with the AI to find the wording that works. Low satisfaction (developers override): add an exception clause, relax the scope, improve the rationale, or remove if the rule is not justified. Low impact (does not matter): remove the rule. Free context for more impactful rules. The scorecard: turns 'our rules need improvement' into specific, per-rule action items.

Track improvement: after making changes, re-score the affected rules next quarter. The scores: should improve. If a rule's compliance was 2 (AI barely follows it) and you added examples: the next quarter's score should be 3-4 (AI follows it most of the time). If the score does not improve: the fix was not effective — try a different approach. AI rule: 'Scoring is not a one-time exercise. It is a quarterly feedback loop: score → act → re-score → verify improvement.'

Share the scorecard: present the top 5 and bottom 5 rules in the quarterly review. The top 5: celebrated (these rules are working). The bottom 5: discussed (what is the fix? Should we remove or improve?). The team: engaged in rule quality because the scorecard makes it tangible. 'Rule #7 has an aggregate of 4.8 — it prevents SQL injection every day' vs 'Rule #22 has an aggregate of 1.8 — nobody follows it and it adds no value.' AI rule: 'The scorecard makes rule quality visible. Visible quality: drives improvement. Invisible quality: stagnates.'

ℹ️ Mismatched Scores Are the Most Actionable Findings

High impact (5) + low compliance (2): the most important rule is not being followed. Action: urgent — improve the rule's wording and add examples. High compliance (5) + low satisfaction (2): the AI follows the rule perfectly but developers hate it. Action: investigate — is the rule correct but poorly explained? Or is it genuinely wrong? Low impact (1) + high compliance (5): the AI follows a rule that does not matter. Action: remove to free context. Each mismatch: a specific problem with a specific solution.

Rule Scoring Summary

Summary of scoring AI rule effectiveness.

4 dimensions: AI compliance, override rate (inverted), developer satisfaction, impact. Each 1-5
Aggregate: average of 4 scores. Rank all rules. Top quartile: protect. Bottom quartile: fix or remove
Mismatched scores: the most actionable insights. High impact + low compliance = priority fix
Bottom quartile actions: improve specificity (low compliance), add exceptions (low satisfaction), remove (low impact)
Quarterly cadence: score → act → re-score. Track improvement over time
Time investment: 30-45 minutes for a 30-rule file. One of the highest-ROI maintenance activities
Share: top 5 and bottom 5 in the quarterly review. Makes rule quality visible and drives improvement
80/20: 20% of rules produce 80% of quality improvement. The scorecard identifies which 20%