AI Rules for File Upload Handling

How AI Handles File Uploads (Every Pattern Is Insecure)

AI generates file upload code with four consistent vulnerabilities: storing files on the local filesystem (lost on server restart, not shared across instances), trusting the client filename (path traversal — ../../../etc/passwd), no MIME type validation (upload a .exe renamed to .jpg), and no file size limit (upload a 10GB file, crash the server). Each vulnerability is well-known and preventable with a single rule.

The correct pattern: upload to object storage (S3, R2, GCS) not the filesystem, generate a safe filename server-side (UUID — never use the client filename), validate MIME type by reading file headers (not by checking the extension), set a size limit in the upload middleware, and scan for viruses before making the file available.

These rules apply to any file upload scenario: user avatars, document uploads, image galleries, CSV imports, and file attachments. The patterns are the same regardless of the file type.

Rule 1: Object Storage, Not Filesystem

The rule: 'Upload files to object storage (AWS S3, Cloudflare R2, Google Cloud Storage, Vercel Blob) — never the local filesystem. Object storage provides: persistence across deploys, shared access across server instances, CDN distribution, and virtually unlimited capacity. Local filesystem fails when: the server restarts (files lost), you scale to multiple instances (files on one server, not the other), and disk fills up (crashes the server).'

For the upload flow: 'Client → server validates + generates key → server uploads to S3 → server stores the key in the database → client accesses via CDN URL or signed URL. Never let the client upload directly to your server filesystem. For large files, use presigned upload URLs: the server generates a signed S3 URL, the client uploads directly to S3, bypassing your server entirely.'

AI generates fs.writeFile(path.join(__dirname, 'uploads', file.name), file.data) — files saved to the server disk with the client-provided filename. This is: a path traversal vulnerability, a disk space bomb, a data loss on restart, and incompatible with horizontal scaling. Object storage eliminates all four.

S3/R2/GCS/Vercel Blob — never local filesystem
Presigned URLs for large files — client uploads directly to S3, skips server
Store the object key in the database — not the file itself
CDN URL for public files — signed URL for private files with expiration
Local filesystem: lost on restart, not shared, fills disk — never in production

⚠️ Filesystem = Data Loss

Files on the local filesystem are: lost on restart/redeploy, not shared across instances, and fill the disk until it crashes. Object storage (S3/R2) is persistent, shared, CDN-distributed, and virtually unlimited.

Rule 2: Safe Filenames and MIME Validation

The rule: 'Never use the client-provided filename for storage. Generate a safe filename server-side: const key = `uploads/${crypto.randomUUID()}.${extension}`. Validate the MIME type by reading file headers (magic bytes) — not by checking the file extension. An .exe can be renamed to .jpg — the extension lies, the magic bytes do not. Use file-type library (Node.js) or python-magic (Python) for header-based detection.'

For allowed types: 'Define an allowlist of accepted MIME types: const ALLOWED = ["image/jpeg", "image/png", "image/webp", "application/pdf"]. Reject any file whose detected MIME type is not in the allowlist. Never use a denylist (block .exe, .bat) — attackers find extensions you did not think of. An allowlist only permits what you explicitly support.'

For the original filename: 'Store the original filename as metadata in the database — for display to the user. Use the generated UUID key for storage and retrieval. This decouples the display name from the storage path: the user sees "quarterly-report.pdf", the storage sees "uploads/a1b2c3d4.pdf". Path traversal is impossible because the storage key is a UUID.'

💡 UUID, Not Client Filename

Client filename '../../../etc/passwd' is a path traversal attack. UUID filename 'a1b2c3d4.pdf' is safe by construction. Store the original name as display metadata in the database — never use it for storage.

Rule 3: File Size Limits

The rule: 'Set file size limits at every layer: web server (nginx: client_max_body_size 10m), application middleware (multer: limits: { fileSize: 10 * 1024 * 1024 }), and object storage (S3 bucket policy). Reject oversized files before reading the entire body — stream and abort when the limit is exceeded. Never read the entire upload into memory before checking size — a 10GB upload will OOM your server.'

For streaming: 'Process uploads as streams, not buffers. Multer (Node.js) streams to disk or S3 — never store the entire file in memory. For S3: use multipart upload for files >5MB — S3 handles chunking. For presigned URLs: set Content-Length-Range in the presigned URL policy to enforce size on the client-to-S3 upload.'

For limits by type: 'Avatar images: 5MB max. Document uploads: 25MB max. Video: 500MB max (use presigned URLs — never through your server). CSV imports: 50MB max. Set realistic limits per endpoint — a single global limit is either too generous (allows abuse) or too restrictive (blocks legitimate large uploads).'

Limits at every layer: nginx, middleware, S3 — defense in depth
Stream uploads — never buffer entire file in memory
Multipart upload for >5MB — S3 handles chunking
Presigned URLs with Content-Length-Range for client-direct uploads
Per-endpoint limits: avatar 5MB, document 25MB, video 500MB

Rule 4: Signed URLs for Private Files

The rule: 'Use signed URLs for private file access: const url = await s3.getSignedUrl(new GetObjectCommand({ Bucket, Key }), { expiresIn: 3600 }). The URL is valid for 1 hour — after that, it returns 403. Never make private files publicly accessible through a guessable URL. Signed URLs provide: time-limited access, no auth header needed (the signature is in the URL), and auditable access (server generates the URL, you know who requested it).'

For public files: 'Set the S3 bucket (or specific prefix) to public read: avatars are public (anyone can view), documents are private (signed URL required). Use CloudFront or CDN for public files — cache at the edge for fast delivery. Use signed URLs for private files — each access goes through your auth check before generating the URL.'

AI generates public URLs for all uploads — including private documents, financial records, and user data. One missing auth check = every uploaded file is publicly accessible to anyone who guesses the URL. Signed URLs enforce auth at the access point, not just the upload point.

Rule 5: Virus Scanning and Post-Upload Processing

The rule: 'Scan uploaded files for malware before making them available to other users. Use ClamAV (open source) or a cloud scanning service (AWS GuardDuty, Cloudflare). Upload flow: receive file → store in quarantine bucket → scan → if clean, move to public bucket → if infected, delete and notify. Never serve unscanned user uploads directly — one infected PDF affects every user who downloads it.'

For image processing: 'Process images after upload: resize to standard dimensions, strip EXIF metadata (contains GPS coordinates, camera info), convert to WebP for web delivery, and generate thumbnails. Use sharp (Node.js), Pillow (Python), or a CDN with image transformation (Cloudflare Images, ImageKit). Never serve user-uploaded images at original size — they can be 20MB+.'

For the quarantine pattern: 'Uploads go to a quarantine bucket/prefix first. A background job (Lambda, queue worker) scans and processes. If clean, move to the serving bucket. If infected, delete and log. The user sees a processing state until the file is cleared. Never skip quarantine for any file type — even images can carry malware.'

Scan before serving: ClamAV or cloud scanner — quarantine → scan → serve
Strip EXIF from images — GPS coordinates, camera info are PII
Resize images: standard dimensions, WebP format, thumbnails
Quarantine bucket → process → clean → serving bucket — never direct serving
Background processing: scan + resize + convert in a queue worker, not in the request

ℹ️ Quarantine First

Upload → quarantine bucket → virus scan → if clean, move to serving bucket. Never serve user uploads directly — one infected PDF affects every downloader. Scanning takes seconds in a background job. Worth every millisecond.

Complete File Upload Rules Template

Consolidated rules for file upload handling.

Object storage (S3/R2/GCS) — never local filesystem — presigned URLs for large files
UUID filenames server-side — never client filename — store original as display metadata
MIME validation by magic bytes — allowlist of accepted types — never trust extension
Size limits at every layer: nginx, middleware, S3 — stream, never buffer in memory
Signed URLs for private files — public bucket only for public assets (avatars)
Virus scanning: quarantine → scan → serve — never serve unscanned uploads
Image processing: resize, strip EXIF, WebP convert — never serve at original size
Background processing in queue workers — not in the upload request handler