Best Practices

AI Rules for Data Export and Import

AI builds export as a synchronous endpoint that times out on large datasets. Rules for async export jobs, streaming CSV/JSON generation, chunked imports with validation, progress tracking, and GDPR data portability.

7 min read·March 9, 2025

Synchronous export times out at 50,000 rows — import fails on row 42 and rolls back all 9,999 valid rows

Async export jobs, streaming generation, chunked import, progress tracking, GDPR data portability

AI Exports Synchronously and Imports Blindly

AI generates data export with: synchronous response (SELECT all rows, serialize to JSON, send in one response — times out at 50,000 rows), in-memory generation (load entire dataset into memory — crashes at 500MB), no progress tracking (user clicks Export, waits, sees nothing, clicks again, waits again), no format options (JSON only, no CSV, no Excel). Import is worse: accept the entire file, parse at once, fail on the first error with no indication of which row failed, and no way to resume.

Modern data export/import is: async (background job generates the file, notifies when ready), streaming (rows written to file as they are read from the database — constant memory usage), progress-tracked (job progress available via API or WebSocket), format-flexible (CSV, JSON, Excel — user chooses), and error-resilient (imports validate per row, report all errors, allow partial success with an error report). AI generates none of these.

These rules cover: async export job pattern, streaming file generation, chunked import with per-row validation, progress tracking, error reporting, and GDPR data portability compliance.

Rule 1: Async Background Export Jobs

The rule: 'Export is a background job, not a synchronous API response. Flow: (1) user clicks Export, (2) API creates a job record and returns 202 Accepted with a job ID, (3) background worker streams data from database to file (S3/R2), (4) job status updated to completed with the download URL, (5) user is notified (email, push, or polling). The user does not wait — they can continue using the app while the export generates.'

For the job record: 'Table: export_jobs { id, userId, status (pending/processing/completed/failed), format (csv/json/xlsx), filters (JSON — what data to export), fileUrl (S3 URL when completed), rowCount, createdAt, completedAt, expiresAt }. The download URL expires after 24 hours (signed S3 URL). The job record is the contract between the user request and the background worker. Polling: GET /api/exports/:jobId returns status and progress.'

AI generates: app.get('/api/export', async (req, res) => { const data = await db.select().from(orders); res.json(data); }) — loads all orders into memory, serializes, sends. At 100,000 orders: 500MB in memory, 30-second response time, gateway timeout. Async job: constant memory (streaming), no timeout (background worker has no HTTP timeout), and the user gets a download link instead of waiting.

  • 202 Accepted with job ID — not synchronous response with data
  • Background worker streams to S3/R2 — constant memory, no timeout
  • Job record: status, format, filters, fileUrl, rowCount, expiresAt
  • Signed download URL: expires after 24 hours for security
  • Notify on completion: email, push, or client polls job status

Rule 2: Streaming File Generation

The rule: 'Generate export files with streaming — never load the full dataset into memory. Pattern: open a write stream to S3 (using multipart upload), open a database cursor (read rows in batches of 1000), for each batch write to the stream, close the stream when done. Memory usage: constant (one batch in memory at a time), regardless of dataset size. A 10GB export uses the same memory as a 10KB export.'

For CSV streaming: 'Stream CSV rows as they are read: const csvStream = new Transform({ transform(row, enc, cb) { cb(null, formatCSVRow(row)); } }); databaseCursor.pipe(csvStream).pipe(s3UploadStream). For JSON: use JSON streaming (json-stream-stringify) to write array elements one at a time instead of JSON.stringify(entireArray). For Excel: use streaming XLSX libraries (exceljs with streaming workbook writer) that write rows without buffering the entire sheet.'

AI generates: const data = await db.select().from(orders); const csv = data.map(row => formatCSV(row)).join('\n'); — entire dataset in memory twice (once as objects, once as string). At 1 million rows: 2GB in memory, Node.js heap overflow. Streaming: 10MB memory usage regardless of row count. Same output file, 200x less memory.

💡 10GB Export, 10MB Memory

AI loads the entire dataset into memory: 1 million rows = 2GB, Node.js crashes. Streaming with database cursor + S3 multipart upload: 10MB memory usage regardless of dataset size. A 10GB export uses the same memory as a 10KB export.

Rule 3: Chunked Import with Per-Row Validation

The rule: 'Import files in chunks, validate each row independently, and report all errors — do not fail on the first error. Flow: (1) upload file to S3 (presigned URL for direct upload), (2) create import job, (3) background worker reads the file in chunks (1000 rows), (4) validate each row against a Zod schema, (5) insert valid rows, (6) collect errors with row numbers, (7) generate an error report. Result: 9,500 rows imported successfully, 500 rows failed with specific errors per row.'

For the error report: 'Generate a downloadable error report: { totalRows: 10000, imported: 9500, failed: 500, errors: [{ row: 42, field: "email", value: "not-an-email", error: "Invalid email format" }, { row: 156, field: "amount", value: "abc", error: "Must be a number" }] }. The user downloads the error report, fixes the 500 rows in their spreadsheet, and re-imports only the failed rows. Partial success is always better than complete failure.'

AI generates: parse the entire CSV at once, INSERT all rows in one transaction — one invalid row rolls back the entire import. 9,999 valid rows lost because row 42 has a bad email. With chunked import: 9,999 rows imported, row 42 reported as failed with the specific error. The user fixes one row instead of re-importing 10,000.

  • Upload to S3 first (presigned URL) — do not stream upload through your API server
  • Validate each row against Zod schema — collect all errors, not fail-fast
  • Partial success: import valid rows, report invalid rows with specific errors
  • Error report: row number + field + value + error message — downloadable
  • Re-import only failed rows — do not force re-import of the entire file
⚠️ 9,999 Valid Rows Lost

One transaction for all rows: row 42 has a bad email, entire import rolls back. 9,999 valid rows lost. Chunked import with per-row validation: 9,999 rows imported, row 42 reported with specific error. User fixes one row, re-imports only the failures.

Rule 4: Progress Tracking and Cancellation

The rule: 'Track and expose job progress: { status: "processing", progress: { current: 45000, total: 100000, percentage: 45 }, estimatedTimeRemaining: "2 minutes" }. Update progress every 1000 rows (not every row — too many updates). Expose via: polling endpoint (GET /api/exports/:jobId), Server-Sent Events (real-time updates without polling), or WebSocket (bidirectional, allows cancellation). Support cancellation: user can abort a long-running export/import, the worker checks a cancelled flag between chunks.'

For estimated time: 'Track rows per second during processing. Estimated time = (total - current) / rowsPerSecond. Display as: "Processing: 45,000 of 100,000 rows (45%) — about 2 minutes remaining." The estimate improves over time as the average stabilizes. For the first 10%: show "Calculating..." instead of a wildly inaccurate estimate. Progress bars without estimates feel slower than progress bars with estimates — even if the estimate is approximate.'

AI generates: the user clicks Export, sees a spinner, waits 3 minutes with no feedback. Are they at 10% or 90%? Is it stuck? Should they click again? They click again: now two export jobs are running. With progress tracking: the user sees real-time progress, knows the estimate, and can cancel if they started the wrong export. Confidence replaces anxiety.

Rule 5: GDPR Data Portability (Article 20)

The rule: 'GDPR Article 20 gives users the right to receive their personal data in a structured, commonly used, machine-readable format. Implementation: a "Download My Data" button in account settings that triggers an async export of all user data: profile, orders, preferences, activity history, uploaded files. Format: JSON (machine-readable) with a ZIP containing the JSON plus any uploaded files. Generate within 30 days (GDPR deadline), but aim for under 24 hours.'

For the data package: 'Structure the export as: user-data-export.zip containing: profile.json (name, email, address, preferences), orders.json (all orders with items and payments), activity.json (login history, actions taken), uploads/ (any files the user uploaded). Each JSON file includes a schema description header explaining the fields. The export is self-contained — a user can understand their data without documentation. Include a README.txt explaining what each file contains.'

AI generates: no data portability feature. A GDPR request arrives: a developer manually queries the database, exports to CSV, emails it to the user. Two developer-hours per request. At 100 requests per month: a full-time job. Automated export: user clicks a button, receives a ZIP in 24 hours, zero developer involvement. Compliance at scale without operational burden.

ℹ️ Two Dev-Hours vs One Button Click

Manual GDPR export: developer queries DB, exports CSV, emails user. Two hours per request, 100 requests/month = full-time job. Automated: user clicks Download My Data, receives ZIP in 24 hours, zero developer involvement. Compliance at scale.

Complete Data Export and Import Rules Template

Consolidated rules for data export and import.

  • Async export: 202 Accepted + background job + signed download URL — not synchronous response
  • Streaming generation: constant memory, database cursor, multipart S3 upload
  • Chunked import: 1000-row batches, per-row Zod validation, partial success with error report
  • Error report: row number + field + value + error — user fixes and re-imports only failed rows
  • Progress tracking: current/total/percentage + estimated time + cancellation support
  • GDPR Article 20: Download My Data button, JSON + ZIP, all user data, under 24 hours
  • Signed URLs expire after 24 hours — export files are not permanently accessible
  • Support CSV, JSON, and XLSX formats — user chooses on export