How to Run an AI Rules Audit

The 4-Dimension Rules Audit

A rules audit: a structured assessment of your AI rule file's quality across 4 dimensions. Completeness: do the rules cover all the conventions the team follows? Effectiveness: do the rules produce the expected AI output? Freshness: are the rules current (no stale references to deprecated technologies)? Alignment: do the rules align with the organization's current technical direction? Each dimension: scored 1-5. The aggregate score: indicates overall rule health. The per-dimension scores: indicate where to focus improvement.

When to audit: the first audit at 3 months after initial deployment (the rules have been used enough to reveal issues). Quarterly thereafter (aligned with the quarterly rule review). After a major technology change (framework migration, new library adoption). After a team restructure (conventions may have shifted). The audit: 2 hours of focused work. The output: a scored assessment and a prioritized improvement plan.

Who conducts the audit: the tech lead or a senior engineer who understands both the rules and the codebase. For organization-level audits: the platform team or a staff engineer who works across teams. The auditor: reads the rules, tests them with prompts, compares them against the actual codebase, and scores each dimension. AI rule: 'The auditor should have used the rules daily for at least 1 month. They need practical experience with how the rules affect AI output, not just theoretical knowledge of what the rules say.'

Step 1: Completeness and Effectiveness (1 Hour)

Completeness assessment: read the last 20 code review comments on the team's PRs. For each convention-related comment: check if a corresponding rule exists. Count: comments with a matching rule (the rule exists but was not followed — effectiveness issue) and comments without a matching rule (the rule is missing — completeness gap). Score: 5 (every convention has a rule), 4 (1-2 missing), 3 (3-5 missing), 2 (6-10 missing), 1 (most conventions are unwritten). AI rule: 'Completeness is measured against actual practice, not against an ideal. The team follows 30 conventions. The rules cover 28. Score: 4 (2 missing). The 2 missing: added to the improvement plan.'

Effectiveness assessment: run 5 benchmark prompts (from the benchmarking tutorial) against the current rules. Score each output on convention compliance (1-5). Average: the effectiveness score. If the average is 4.5+: the rules are highly effective (the AI follows them consistently). If 3.0-4.4: moderately effective (some rules need refinement). If below 3.0: significant issues (many rules are too vague, conflicting, or the AI is not following them). AI rule: 'Effectiveness is the most important dimension. A rule file can be complete and fresh — but if the AI does not follow the rules, the file is ineffective. Test with actual prompts, not by reading the rules.'

Common findings: completeness gaps cluster around: recently adopted patterns (the team switched to a new testing approach but the rules still describe the old one), domain-specific conventions (project-specific terminology and patterns that are assumed but not documented), and cross-cutting concerns (logging, monitoring, deployment patterns that are not in the rules). Effectiveness gaps cluster around: vague rules (the AI interprets them differently), conflicting rules (the AI follows one and ignores the other), and rules without examples (the AI does not have a concrete pattern to follow).

💡 Code Review Comments = Pre-Built Completeness Checklist

The last 20 code review comments: are a pre-built checklist of what the rules should cover. 'Please use our Result pattern' (3 times) → the error handling rule exists but is not effective (effectiveness issue). 'We use pnpm, not npm' (2 times) → no rule about the package manager (completeness gap). 'Tests should use describe/it naming' (4 times) → the testing rule is missing or vague. Review comments: the most accurate source of what the rules should address.

Step 2: Freshness and Alignment (30 Minutes)

Freshness assessment: read each rule and check for stale references. Stale indicators: library versions that do not match the project's current versions (the rule says TypeScript 5.3, the project uses 5.5), references to deprecated patterns or libraries (the rule mentions a library that was replaced), file path references that no longer exist (the rule says 'follow the pattern in src/utils/format.ts' but the file was renamed), and framework patterns from a previous version (Pages Router patterns in an App Router project). Score: 5 (all references current), 4 (1-2 stale), 3 (3-5 stale), 2 (6-10 stale), 1 (most rules reference outdated content).

Alignment assessment: compare the rules against the organization's current technical direction. Alignment questions: do the rules support the target architecture (if the org is moving to microservices, do the rules reflect microservice patterns)? Do the rules encode the current technology strategy (approved languages, frameworks, libraries)? Do the rules reflect the current compliance requirements (if SOC 2 was recently achieved, are compliance rules encoded)? Score: 5 (fully aligned), 4 (mostly aligned with minor gaps), 3 (partially aligned — some rules reflect old direction), 2 (significantly misaligned), 1 (the rules describe a different technology strategy).

AI rule: 'Freshness and alignment are the fastest dimensions to assess (30 minutes total). They are also the easiest to fix: stale references are updated in minutes. Alignment gaps: require adding rules for the new direction and deprecating rules for the old direction.'

⚠️ Effectiveness Is the Most Important Dimension

A rule file that is: complete (covers everything), fresh (all references current), and aligned (matches org direction) — but ineffective (the AI does not follow the rules): is a failure. The AI generates generic code regardless of the comprehensive rules. Effectiveness: the dimension that determines whether the rules actually work. Test with prompts, not by reading the rules. A beautiful rule file that the AI ignores: is a documentation project, not a coding standard.

Step 3: Scoring and Improvement Plan (30 Minutes)

Aggregate scoring: average the 4 dimension scores. Overall score interpretation: 4.5-5.0 (excellent — maintain with quarterly audits). 3.5-4.4 (good — address the weakest dimension). 2.5-3.4 (needs improvement — prioritize effectiveness and completeness). Below 2.5 (significant issues — consider rewriting the rules from scratch using a template). The aggregate: the headline metric for the quarterly review. The per-dimension scores: where to focus improvement effort.

Prioritized improvement plan: based on the audit, create an ordered list of improvements. Priority 1 (this week): fix any stale references that cause the AI to generate wrong code. Priority 2 (this sprint): add the top 3 missing rules from the completeness assessment. Priority 3 (this quarter): refine the effectiveness of existing rules (add examples, resolve conflicts, increase specificity). Priority 4 (next quarter): align rules with the organization's evolving technical direction. The prioritization: addresses the most impactful issues first.

Track improvement: after implementing the improvements, re-run the audit (or at least the affected dimensions). The scores: should improve. If they do not: the improvements were not effective (revise the approach). Over time: the audit scores trend upward, demonstrating that the rules are continuously improving. AI rule: 'The audit score: the metric that demonstrates rule quality improvement over time. Track it quarterly. A rising trend: proves the investment in rules is paying off.'

ℹ️ The Audit Score: A Quarterly Trending Metric

Q1 audit: 3.2 (needs improvement — many stale references, 5 missing rules). Q2 audit: 3.8 (good — stale references fixed, 3 missing rules added). Q3 audit: 4.2 (strong — effectiveness improved with examples added). Q4 audit: 4.5 (excellent — all dimensions strong). The trend: upward. The story: continuous improvement. Present to leadership: the audit score trend proves the rules program is working and getting better. A flat or declining trend: signals the rules need more investment.

Rules Audit Summary

Summary of running an AI rules audit.

4 dimensions: completeness (coverage), effectiveness (AI follows them), freshness (current references), alignment (matches org direction)
Scoring: each dimension 1-5. Aggregate average. 4.5+ excellent, 3.5-4.4 good, below 2.5 rewrite
Completeness: compare last 20 review comments against rule file. Missing rules = completeness gaps
Effectiveness: 5 benchmark prompts. Score AI output on convention compliance. The most important dimension
Freshness: check every reference (versions, libraries, file paths). Stale = update immediately
Alignment: compare rules against org technical direction. Gaps = add rules for new direction
Time: 2 hours total. Completeness+effectiveness (1hr), freshness+alignment (30min), plan (30min)
Improvement plan: prioritized. Stale fixes first, missing rules second, effectiveness third, alignment fourth