AI Rules for Regex Patterns

Why AI-Generated Regex Needs Rules

Regular expressions are one of the few areas where AI-generated code can be a genuine security vulnerability. A poorly constructed regex can cause catastrophic backtracking — where a single input string causes the regex engine to hang for hours, effectively becoming a denial-of-service attack vector (ReDoS). AI assistants generate these vulnerable patterns routinely because they optimize for matching correctness, not engine performance.

Beyond security, AI-generated regex has a readability problem. The AI produces patterns that are technically correct but incomprehensible: /^(?:(?:[a-zA-Z0-9._%+-]+)@(?:[a-zA-Z0-9.-]+)\.(?:[a-zA-Z]{2,}))$/ is a valid email regex, but no human can review it confidently. If you can't verify the regex is correct by reading it, it's a maintenance liability.

These rules cover three areas: safety (preventing ReDoS), readability (making patterns reviewable), and pragmatism (knowing when not to use regex at all).

Rule 1: Prevent Catastrophic Backtracking (ReDoS)

The rule: 'Never use nested quantifiers: (a+)+ or (a*)*. Never use overlapping alternations where both branches can match the same input: (a|a)+. Avoid unbounded repetition with complex groups: (.+)+ or (\s*\S*)*. Use atomic groups or possessive quantifiers where supported. Test all regex patterns against ReDoS detection tools (recheck, safe-regex) before committing.'

For validation regex: 'Set maximum input length before regex evaluation. If validating user input, cap the string length first (if (input.length > 1000) return false), then apply the regex. This prevents ReDoS even if the pattern has backtracking issues — short inputs can't trigger exponential behavior.'

AI assistants generate ReDoS-vulnerable patterns because their training data includes millions of regex examples without safety annotations. The AI doesn't distinguish between a pattern that works on test data and one that hangs on adversarial input.

No nested quantifiers: (a+)+, (a*)*, (.+)+ are all dangerous
No overlapping alternations: (a|ab)+ can backtrack exponentially
Cap input length before regex evaluation — short-circuit on oversized input
Test with safe-regex or recheck in CI — flag vulnerable patterns
Use RE2 engine (Go, Rust) where available — guaranteed linear-time matching

⚠️ ReDoS Is Real

A regex like (a+)+ takes exponential time on input 'aaaaaaaaaaaaaaaaX'. 20 a's = seconds. 30 a's = hours. AI generates these patterns routinely. Test every regex with safe-regex.

Rule 2: Readable Regex Patterns

The rule: 'Use named capture groups for all captures: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}) instead of (\d{4})-(\d{2})-(\d{2}). Use the verbose/extended flag (/x in most languages) for complex patterns — add comments and whitespace for readability. Break complex patterns into composed smaller patterns where the language supports it.'

For JavaScript/TypeScript: 'Use the /v flag (ES2024 unicodeSets) for modern regex features. Use String.raw for patterns with many backslashes: new RegExp(String.raw`\d{4}-\d{2}-\d{2}`). For complex patterns, build from smaller pieces: const year = /\d{4}/; const date = new RegExp(`${year.source}-...`).'

For Python: 'Use re.VERBOSE (re.X) for multi-line patterns with comments. Use raw strings (r"...") for all patterns. Use re.compile for patterns used multiple times — it's clearer and slightly faster.'

Rule 3: Use Validated Patterns for Common Tasks

The rule: 'Don't write regex from scratch for well-known formats. Use validated, tested patterns from a shared library or constant file. Common formats that need validated patterns: email (use a simplified RFC 5322 subset), URLs (use the URL constructor for validation instead), IPv4/IPv6 addresses, dates, phone numbers, and UUIDs.'

For email: 'Never use a regex to fully validate email addresses — the spec is too complex. Use a simple regex for basic format checking (/^[^\s@]+@[^\s@]+\.[^\s@]+$/) and verify deliverability by sending a confirmation email. Or use a validation library that handles the edge cases.'

For URLs: 'Don't use regex for URL validation. Use the built-in URL constructor: try { new URL(input) } catch { /* invalid */ }. The URL parser handles edge cases that regex can't — internationalized domain names, unusual port numbers, encoded paths.'

Email: simple format check + confirmation email — not full RFC 5322 regex
URLs: URL constructor, not regex — handles IDN, ports, encoding
UUIDs: /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i
Dates: parse with Date or a date library — regex for format, not validity
Phone: use libphonenumber — phone formats are too complex for regex

💡 URL Constructor > Regex

Don't validate URLs with regex. new URL(input) handles internationalized domains, unusual ports, and encoded paths. Regex can't. Same for email — simple format check + confirmation email.

Rule 4: When Not to Use Regex

The rule: 'Don't use regex to parse structured formats. Don't parse HTML with regex — use a DOM parser (cheerio, jsdom, Beautiful Soup). Don't parse JSON with regex — use JSON.parse. Don't parse CSV with regex — use a CSV parser library. Don't parse programming languages with regex — use a proper parser. Regex is for pattern matching in unstructured text, not for parsing structured data.'

For string operations: 'Use string methods when they suffice. string.includes() is clearer than /substring/.test(string). string.startsWith() is clearer than /^prefix/. string.split("delimiter") is clearer than string.split(/delimiter/). Only use regex when the pattern has actual regex features (character classes, quantifiers, alternation).'

This rule prevents the 'regex hammer' problem — when the AI treats every string operation as a regex problem. Simple string methods are faster, more readable, and easier to maintain.

Rule 5: Testing Regex Thoroughly

The rule: 'Every regex pattern has a dedicated test with: positive matches (strings that should match), negative matches (strings that should not match), edge cases (empty string, very long string, unicode, special characters), and the documented intent (a comment explaining what the pattern is supposed to match). Use regex-specific testing tools (regex101 for development, unit tests for CI).'

For edge cases: 'Always test with: empty string, single character, maximum expected length, unicode characters (emoji, accented letters, CJK), strings with only whitespace, strings with regex special characters (.+*?[]{}()|^$\), and strings designed to trigger backtracking.'

Regex bugs are among the hardest to debug because the pattern is opaque and the failure mode (wrong match, missed match, hang) varies with input. Comprehensive tests are the only reliable way to verify correctness.

ℹ️ Test Unicode

Always test regex with emoji, accented characters, and CJK text. Many patterns that work on ASCII break on unicode — \w doesn't match accented letters in most engines without the /u flag.

Complete Regex Rules Template

Consolidated rules for regular expressions in any language.

No nested quantifiers (ReDoS) — test all patterns with safe-regex in CI
Cap input length before regex evaluation — short-circuit on oversized strings
Named capture groups — verbose mode for complex patterns — comments explaining intent
Validated patterns for email, UUID, dates — don't reinvent common formats
URL constructor over regex — DOM parser over regex for HTML — JSON.parse over regex
String methods when sufficient — includes() over /x/.test(), startsWith() over /^x/
Unit tests with positive, negative, edge cases, and unicode for every pattern
Document intent: what should this pattern match? What should it reject?