Best Practices

AI Rules for CSV Processing

AI splits CSV on commas and breaks on the first quoted field. Rules for proper CSV parsing libraries, streaming large files, encoding detection, header mapping, data validation, and edge case handling.

7 min read·March 11, 2025

line.split(',') breaks on 'Smith, Jr.' — the first quoted field destroys the entire parse

Papa Parse streaming, encoding detection, BOM handling, header mapping, per-row Zod validation

AI Parses CSV with string.split(',') and Prays

AI generates CSV parsing with: line.split(',') (breaks on the first field containing a comma: "Smith, Jr." becomes two fields), no quote handling ("field with "quotes" inside" is unparseable), no encoding detection (UTF-8 BOM marker becomes part of the first field name), loading the entire file into memory (a 500MB CSV crashes the process), and no validation (garbage data flows directly into the database). CSV looks simple until you encounter real-world data — then every edge case appears simultaneously.

Modern CSV processing is: library-parsed (Papa Parse or csv-parse handle all RFC 4180 edge cases), stream-processed (rows parsed and processed one at a time, constant memory), encoding-aware (BOM detection, UTF-8/Latin-1/Windows-1252 handling), header-mapped (column names to field names with aliases for common misspellings), and row-validated (Zod schema per row, errors collected, partial success). AI generates none of these.

These rules cover: proper CSV parsing libraries, streaming for large files, encoding detection, header mapping, per-row validation, and the edge cases that break naive parsers.

Rule 1: Use a CSV Parsing Library, Never split()

The rule: 'Use Papa Parse (browser + Node.js) or csv-parse (Node.js) for all CSV parsing. These libraries handle RFC 4180 correctly: quoted fields ("field with, comma"), escaped quotes ("field with ""quotes"" inside"), newlines within fields ("multi\nline\nfield"), and different delimiters (tab-separated, semicolon-separated). Papa Parse: Papa.parse(csvString, { header: true, dynamicTyping: true }). csv-parse: const parser = parse({ columns: true, cast: true }).'

For delimiter detection: 'Real-world CSVs use: commas (US standard), semicolons (European standard — because commas are decimal separators), tabs (TSV), and pipes (legacy systems). Papa Parse auto-detects the delimiter from the first few rows. csv-parse requires explicit delimiter configuration. Always auto-detect or ask the user — assuming commas breaks European CSVs silently (data appears correct but fields are merged).'

AI generates: const rows = csv.split('\n').map(row => row.split(',')) — two lines that fail on: fields containing commas, fields containing newlines, fields containing quotes, files with Windows line endings (\r\n), files with BOM markers, and tab-delimited files. A CSV library handles all of these. Two lines of split() vs one line of Papa.parse() — the library is shorter AND correct.

  • Papa Parse: browser + Node.js, auto-detect delimiter, streaming support
  • csv-parse: Node.js, RFC 4180 compliant, piping with transforms
  • Handles: quoted fields, escaped quotes, newlines in fields, BOM markers
  • Auto-detect delimiter: comma, semicolon, tab, pipe — do not assume commas
  • split(',') fails on the first real-world CSV — use a library, always
💡 Library Is Shorter AND Correct

split(',') on two lines fails on: quoted fields, escaped quotes, newlines, BOM, and tab delimiters. Papa.parse(csv, { header: true }) on one line handles all of them. The library is less code and handles every RFC 4180 edge case.

Rule 2: Stream Processing for Large Files

The rule: 'Process CSV files as streams — never load the entire file into memory. Papa Parse: Papa.parse(readableStream, { step: (row) => { processRow(row.data); }, complete: () => { console.log('Done'); } }). csv-parse: fs.createReadStream('data.csv').pipe(parse({ columns: true })).on('data', (row) => { processRow(row); }). Memory usage: constant (one row in memory at a time), regardless of file size. A 5GB CSV uses the same memory as a 5KB CSV.'

For backpressure handling: 'When processing is slower than parsing (database insert slower than CSV read), implement backpressure: pause the read stream when the processing queue is full, resume when it drains. Node.js streams handle this automatically when using pipe(). For manual control: parser.pause() when the batch queue reaches 1000 rows, parser.resume() after the batch is inserted. Without backpressure: the parser fills memory with unprocesed rows until the process crashes.'

AI generates: const data = fs.readFileSync('data.csv', 'utf8'); const rows = Papa.parse(data).data; — entire file in memory as a string, then parsed into an array of arrays. 500MB file = 1.5GB memory (string + parsed objects). Streaming: 10MB memory regardless of file size. The difference between 'works on the test file' and 'works on the production file.'

Rule 3: Encoding Detection and BOM Handling

The rule: 'Detect file encoding before parsing. Common encodings: UTF-8 (standard), UTF-8 with BOM (Windows Excel exports — the BOM \xEF\xBB\xBF becomes part of the first header name if not stripped), Windows-1252 (legacy European), Latin-1 (ISO 8859-1). Libraries: chardet (auto-detect encoding) or iconv-lite (convert between encodings). Strip the BOM before parsing: if (buffer[0] === 0xEF && buffer[1] === 0xBB && buffer[2] === 0xBF) buffer = buffer.slice(3).'

For the BOM problem: 'Excel saves CSV files with a UTF-8 BOM (Byte Order Mark). The BOM is invisible but prepends three bytes to the file. If not stripped, the first column header becomes \xEF\xBB\xBFname instead of name. The header does not match your expected column names. Lookups fail silently: row['name'] is undefined because the actual key is '\xEF\xBB\xBFname'. One of the most common and hardest-to-debug CSV issues. Strip the BOM or use a library that handles it.'

AI generates: fs.readFileSync('data.csv', 'utf8') — assumes UTF-8, does not strip BOM, crashes on Windows-1252 characters (accented names like 'René' become garbage). With encoding detection: detect the actual encoding, convert to UTF-8, strip BOM, then parse. The data is correct regardless of how the file was exported.

⚠️ The Invisible BOM Problem

Excel exports CSV with a UTF-8 BOM (3 invisible bytes). First header becomes '\xEF\xBB\xBFname' instead of 'name'. row['name'] returns undefined. One of the most common and hardest-to-debug CSV issues. Strip BOM before parsing or use a library that handles it.

Rule 4: Header Mapping with Column Aliases

The rule: 'Map CSV column headers to your data model field names, supporting aliases for common variations. Map: { "First Name": "firstName", "first_name": "firstName", "FirstName": "firstName", "fname": "firstName" }. After parsing: transform each row using the mapping. The user uploads a CSV with any common column naming convention, and it maps to your schema automatically. For unrecognized columns: ignore (do not fail), but warn the user which columns were skipped.'

For the mapping UI: 'When automatic mapping fails (unknown column names), show a mapping interface: left column = CSV headers, right column = dropdown of your fields. The user manually maps: "Customer Full Name" → firstName + lastName. Save the mapping for future imports from the same source. Returning users: their mapping is remembered, imports are one-click. First-time users: guided mapping, no ambiguity.'

AI generates: row['First Name'] with no aliases. The CSV has 'first_name' as the header. Result: undefined for every row. The user re-exports the CSV with different headers (if they even know the expected format). With alias mapping: 'first_name', 'First Name', 'FirstName', and 'fname' all map to firstName automatically. The import works regardless of the source system column naming.

  • Alias map: multiple CSV header names → one model field name
  • Case-insensitive matching: 'First Name' = 'first_name' = 'FIRST_NAME'
  • Trim whitespace from headers: ' name ' → 'name' (trailing spaces are common)
  • Unrecognized columns: warn, do not fail — user may have extra columns
  • Save mappings per source: returning imports are one-click with remembered mapping

Rule 5: Per-Row Validation and Error Collection

The rule: 'Validate each row against a Zod schema after header mapping. Collect all errors with row numbers — do not fail on the first invalid row. Pattern: const errors = []; rows.forEach((row, i) => { const result = schema.safeParse(row); if (!result.success) errors.push({ row: i + 2, issues: result.error.issues }); else validRows.push(result.data); }). The +2 accounts for the header row and zero-indexing. Return: { imported: validRows.length, failed: errors.length, errors }.'

For common CSV data issues: 'Empty strings that should be null: transform '' to null before validation. Number fields with currency symbols: strip $ and commas before parsing ($1,234.56 → 1234.56). Date fields in inconsistent formats: try multiple formats (MM/DD/YYYY, DD/MM/YYYY, YYYY-MM-DD) with date-fns parse. Boolean fields as strings: map 'yes', 'true', '1', 'Y' to true. Each transformation is a pre-validation step in the parsing pipeline.'

AI generates: CSV data inserted directly into the database with no validation. A text field in a number column: database error, import aborts. A malformed date: stored as a string, breaks date queries later. With per-row validation: every row is type-checked and transformed before insert. Invalid rows are reported with specific errors. The database receives only clean, validated data.

ℹ️ $1,234.56 Is Not a Number

CSV number fields with currency symbols and commas: $1,234.56. Zod says: not a number. Pre-validation transforms: strip $, strip commas, parse as float → 1234.56. Dates, booleans, and currencies all need transforms before schema validation.

Complete CSV Processing Rules Template

Consolidated rules for CSV processing.

  • Papa Parse or csv-parse — never split(','), handles all RFC 4180 edge cases
  • Stream processing: constant memory, backpressure handling for slow consumers
  • Encoding detection: chardet for auto-detect, strip UTF-8 BOM before parsing
  • Delimiter auto-detection: comma, semicolon, tab, pipe — do not assume commas
  • Header mapping with aliases: 'First Name' = 'first_name' = 'fname' → firstName
  • Per-row Zod validation: collect all errors with row numbers, partial success
  • Data transforms: strip currency symbols, parse dates, normalize booleans
  • Save column mappings per source — returning imports are one-click