Why MongoDB Needs Document-Oriented Rules
MongoDB is a document database — data is stored as flexible JSON documents, not relational rows. AI assistants trained on SQL databases apply relational patterns to MongoDB: normalizing data across multiple collections, using references everywhere instead of embedding, creating join-like queries with $lookup, and ignoring document size limits. The result is a MongoDB database that performs like a slow relational database without the relational guarantees.
The most common AI failures: creating a collection per 'table' (Users, Orders, OrderItems, Addresses — all separate, all requiring lookups), ignoring embedding (the entire purpose of document databases), no indexes on query fields, using .find() loops instead of aggregation pipelines, and defining Mongoose schemas without TypeScript types.
These rules target Mongoose 8+ with TypeScript. They cover document design, Mongoose schema patterns, indexing, and the aggregation framework.
Rule 1: Embed Data That's Accessed Together
The rule: 'Embed data that is always read/written together. An Order document contains its OrderItems as a subdocument array — not a separate collection with references. A User document contains their Address as an embedded object. The question is: "When I read X, do I always need Y?" If yes, embed Y in X. If Y is accessed independently or shared across documents, reference it.'
For embedding vs referencing: 'Embed when: data is always accessed with the parent, data doesn't change independently, the embedded array is bounded (order items, addresses — not unbounded like all comments ever). Reference when: data is shared across documents (categories, tags), data grows unboundedly (millions of log entries), or data is accessed independently (user profile separate from user settings).'
AI normalizes everything because SQL training data says normalization is good. In MongoDB, normalization means multiple queries for data that could be one read. The 16MB document size limit is the constraint — not normalization principles.
- Embed: data accessed together, bounded arrays, owned by parent
- Reference: shared data, unbounded growth, independently accessed
- Order + OrderItems: embed — always read together, bounded per order
- User + Comments: reference — unbounded, accessed independently
- 16MB document limit — embed until you approach it, then restructure
'When I read X, do I always need Y?' If yes, embed Y in X — one read instead of two. Order + OrderItems: embed (always together, bounded). User + AllComments: reference (unbounded, independent).
Rule 2: Typed Mongoose Schemas
The rule: 'Define Mongoose schemas with TypeScript interfaces: interface IUser { name: string; email: string; orders: Types.ObjectId[]; createdAt: Date; }. Create schema: const userSchema = new Schema<IUser>({ name: { type: String, required: true }, email: { type: String, required: true, unique: true }, orders: [{ type: Schema.Types.ObjectId, ref: "Order" }], createdAt: { type: Date, default: Date.now } }). Export model: export const User = model<IUser>("User", userSchema).'
For validation: 'Use Mongoose built-in validators: required, min, max, minlength, maxlength, enum, match (regex). Use custom validators for complex rules: validate: { validator: (v) => /\S+@\S+/.test(v), message: "Invalid email" }. Validate at the schema level — not in the route handler. Mongoose validation runs automatically on save() and create().'
For virtuals and methods: 'Use virtuals for computed properties: userSchema.virtual("fullName").get(function() { return `${this.firstName} ${this.lastName}` }). Use methods for instance operations: userSchema.methods.comparePassword = async function(password) { ... }. Use statics for collection operations: userSchema.statics.findByEmail = function(email) { return this.findOne({ email }) }.'
Rule 3: Indexing Strategy
The rule: 'Create indexes for every field used in query filters, sort, and lookup: userSchema.index({ email: 1 }, { unique: true }). Create compound indexes for queries that filter on multiple fields: orderSchema.index({ userId: 1, createdAt: -1 }). The order matters — most selective field first. Use text indexes for search: productSchema.index({ name: "text", description: "text" }).'
For compound indexes: 'A compound index on { userId: 1, status: 1, createdAt: -1 } supports queries on: userId alone, userId + status, and userId + status + createdAt. It does NOT efficiently support: status alone or createdAt alone (they're not the index prefix). Design compound indexes to match your query patterns.'
AI generates queries without indexes — every find() is a full collection scan. On a 100-row development database, it's unnoticeable. On a million-row production database, it's a timeout. Create indexes alongside the schema — not as an afterthought.
- Index every field in where/sort/lookup — schema.index()
- Compound indexes: match query patterns — most selective first
- Unique indexes: schema.index({ email: 1 }, { unique: true })
- Text indexes for search: schema.index({ name: 'text', description: 'text' })
- Use explain() to verify queries use indexes — db.collection.find().explain()
Every find() without an index scans the entire collection. On 100 rows: unnoticeable. On 1M rows: timeout. Create indexes alongside schemas — not as a performance afterthought. Use explain() to verify.
Rule 4: Aggregation Pipelines Over Application Logic
The rule: 'Use MongoDB aggregation pipelines for: grouping, counting, averaging, joining ($lookup), transforming, and filtering data at the database level. Never fetch all documents and process in JavaScript — the database is faster and uses less memory. Pipeline stages: $match (filter), $group (aggregate), $project (reshape), $sort, $limit, $lookup (join), $unwind (flatten arrays).'
For common patterns: 'Count by status: [{ $group: { _id: "$status", count: { $sum: 1 } } }]. Average order value: [{ $group: { _id: null, avg: { $avg: "$total" } } }]. Join with another collection: [{ $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "orders" } }]. Top N: [{ $sort: { score: -1 } }, { $limit: 10 }].'
AI generates find().then(results => results.filter().map().reduce()) — processing millions of documents in JavaScript. Aggregation pipelines push this work to the database: faster, less memory, and returns only the results you need.
find().then(results => results.filter().map().reduce()) fetches millions of documents to process in JS. Aggregation pipelines push the work to the database: faster, less memory, returns only results. $match → $group → $project.
Rule 5: Mongoose-Specific Patterns
The rule: 'Use lean() for read-only queries: User.find({}).lean() returns plain objects instead of Mongoose documents — 3-5x faster, less memory. Use populate() for reference loading: Order.find({}).populate("user", "name email") — second argument selects fields. Use bulkWrite for batch operations: User.bulkWrite([{ insertOne: { document: user1 } }, { updateOne: { filter: { _id: id }, update: { $set: { active: false } } } }]).'
For transactions: 'Use MongoDB transactions for multi-document atomic operations: const session = await mongoose.startSession(); session.startTransaction(); try { await User.create([userData], { session }); await Account.create([accountData], { session }); await session.commitTransaction(); } catch { await session.abortTransaction(); } finally { session.endSession(); }. Transactions require replica set — not available on standalone MongoDB.'
For connection management: 'Connect once at startup: await mongoose.connect(uri). Set connection options: maxPoolSize, serverSelectionTimeoutMS, heartbeatFrequencyMS. Handle connection events: mongoose.connection.on("error"), on("disconnected"). Use mongoose.connection.readyState to check connection health in health endpoints.'
Complete Mongoose/MongoDB Rules Template
Consolidated rules for Mongoose and MongoDB projects.
- Embed data accessed together — reference for shared/unbounded/independent data
- Typed schemas: interface + Schema<IType> + model<IType> — validators on schema
- Index every query field — compound indexes match query patterns — explain() to verify
- Aggregation pipelines for grouping/counting/joining — never fetch-all + JS processing
- lean() for read-only — populate() for references — bulkWrite for batch operations
- Transactions with sessions for multi-document atomicity (requires replica set)
- Connect once at startup — maxPoolSize — handle error/disconnected events
- Virtuals for computed — methods for instance — statics for collection operations