AI Rules for Data Engineering Teams

Data Pipelines: Where AI Rules Prevent Silent Failures

Data engineering code is uniquely dangerous: failures are often silent. A web application bug shows an error page. A data pipeline bug produces incorrect data that looks correct — wrong aggregations, missing records, stale joins — that feeds into dashboards, ML models, and business decisions without anyone noticing. AI rules for data engineering focus on preventing these silent failures through idempotency, validation, and observability.

The data engineering stack: orchestration (Airflow, Dagster, Prefect), transformation (dbt, Spark, SQL), storage (data warehouses like Snowflake, BigQuery, Redshift; data lakes on S3/GCS), and streaming (Kafka, Kinesis, Flink). AI rule: 'Detect the project's data stack from existing code and configuration. Generate code that follows the stack's conventions. dbt project: generate SQL models with the dbt pattern. Airflow project: generate DAGs with the Airflow pattern. Do not mix conventions across tools.'

The core data engineering AI rules: every pipeline must be idempotent (re-running produces the same result), every transformation must have data quality checks, every schema change must be backward-compatible, and every pipeline must have observability (row counts, freshness, latency).

Pipeline Idempotency and Reliability

Idempotency: running a pipeline twice with the same inputs produces the same output. This is critical because: pipelines fail mid-execution and must be retried, orchestrators may trigger duplicate runs, and data reprocessing (backfills) must not create duplicates. AI rule: 'Every pipeline step must be idempotent. For batch writes: use MERGE/UPSERT (not INSERT which creates duplicates). For partition-based loads: overwrite the entire partition (DELETE + INSERT or REPLACE). For streaming: use exactly-once semantics or idempotent writes with deduplication keys.'

Backfill capability: every pipeline must support reprocessing historical data. AI rule: 'Design pipelines with a date parameter. Running the pipeline for 2026-03-15 processes that day's data. Backfilling: run the pipeline for each date in the range. The pipeline's idempotency ensures backfills do not create duplicates. Never hardcode dates or assume the pipeline runs only once per day.'

Failure handling: when a pipeline step fails, the orchestrator should: retry with exponential backoff (for transient failures), alert the data team (for persistent failures), and not proceed to downstream steps (partial data is worse than no data). AI rule: 'Pipeline steps: retry 3 times with exponential backoff. On persistent failure: alert and mark the run as failed. Downstream steps: depend on upstream success. Never process partial data — it creates incorrect aggregations that are harder to fix than missing data.'

⚠️ INSERT Without Deduplication = Duplicate Data

A pipeline that uses INSERT INTO target SELECT ... FROM source: creates duplicates every time it is retried. Run it once: correct data. Run it twice (retry after timeout): every record is doubled. The fix: use MERGE (SQL Server, Snowflake), INSERT ... ON CONFLICT DO UPDATE (PostgreSQL), or DELETE partition then INSERT (partition-based). The AI must never generate INSERT-only pipeline steps for data that could be reprocessed.

Schema Evolution and Data Quality

Schema evolution: data schemas change over time (new columns, type changes, renamed fields). Changes must be backward-compatible to avoid breaking downstream consumers. AI rule: 'Schema changes: additive only (add columns, not remove). New columns: nullable with defaults. Column removal: deprecate first (stop populating, notify consumers), remove after all consumers have migrated. Column type changes: add a new column with the new type, migrate consumers, then deprecate the old column. Never change a column type in place.'

Data quality checks: every pipeline should validate its output. Checks: row count (did we process the expected number of records?), null rate (are critical columns unexpectedly null?), uniqueness (are primary keys unique?), referential integrity (do foreign keys reference existing records?), and freshness (is the data from the expected time period?). AI rule: 'Generate data quality checks after every transformation step. dbt: use dbt tests (unique, not_null, accepted_values, relationships). SQL: generate CHECK constraints and validation queries. Fail the pipeline if quality checks fail.'

Data contracts: agreements between data producers and consumers about schema, quality, and SLAs. AI rule: 'Define data contracts for shared datasets: schema (column names, types, nullability), quality (minimum row count, maximum null rate, uniqueness constraints), freshness (data available by what time), and ownership (who to contact when the contract is violated). Generate contract validation as part of the pipeline.'

💡 Data Quality Checks Catch Silent Failures

A pipeline runs successfully (no errors) but produces wrong data: a join condition was incorrect, producing 10x more rows than expected. Without quality checks: the inflated data feeds into dashboards, showing revenue 10x higher than reality. Decisions are made on bad data. With a row count check (expected ~100K rows, got 1M): the pipeline fails immediately, the team investigates, and bad data never reaches consumers. Quality checks are the data engineering equivalent of unit tests.

Warehouse and Transformation Conventions

Warehouse layer conventions: raw (source data as-is, append-only), staging (cleaned and typed, one-to-one with source), intermediate (business logic transformations, joins), and mart (final business entities optimized for consumption). AI rule: 'Follow the warehouse layer convention. Raw: no transformations, preserve source schema. Staging: type casting, renaming to snake_case, deduplication. Intermediate: joins, business logic, derived columns. Mart: aggregations, denormalization for BI tools. Never skip layers — raw to mart transformations are unmaintainable.'

dbt conventions (if the project uses dbt): models organized by layer (models/staging/, models/intermediate/, models/marts/), sources defined in schema.yml, tests on every model, incremental models for large datasets, and documentation for every model and column. AI rule: 'dbt project: follow the dbt style guide. Staging models: stg_{source}__{entity}.sql. Mart models: named by business entity (customers, orders, revenue). Every model: at least one test (unique on primary key, not_null on required columns). Generate schema.yml documentation alongside SQL models.'

SQL style: the AI must follow the project's SQL conventions. AI rule: 'SQL style: lowercase keywords (select, from, where) or uppercase (SELECT, FROM, WHERE) — match the existing codebase. CTEs over subqueries for readability. Explicit column lists (never SELECT *). Column aliases with AS keyword. Table aliases that are meaningful (orders o, not orders a). Comments on complex logic.'

ℹ️ Never Skip Warehouse Layers

A raw-to-mart query that joins 5 source tables, applies business logic, and aggregates in one step: unmaintainable. When the business logic changes: you rewrite the entire query. When a source schema changes: the impact is unclear. With layers: raw (unchanged source), staging (cleaned), intermediate (joined with business logic), mart (aggregated). When logic changes: update the intermediate layer. When a source changes: update the staging layer. Each layer has a single responsibility.

Data Engineering AI Rules Summary

Summary of AI rules for data engineering teams building pipelines and data platforms.

Idempotency: MERGE/UPSERT for batch, partition overwrite, deduplication keys for streaming
Backfill: date-parameterized pipelines. Reprocessing must not create duplicates
Failure handling: retry with backoff. Alert on persistent failure. Never process partial data
Schema evolution: additive only. New columns nullable. Deprecate before removing. Never change types in place
Data quality: row counts, null rates, uniqueness, referential integrity, freshness checks after every step
Data contracts: schema + quality + freshness + ownership. Validated in the pipeline
Warehouse layers: raw → staging → intermediate → mart. Never skip layers
dbt: follow style guide. Tests on every model. Documentation in schema.yml alongside SQL