CLAUDE.md for R and Data Science Projects

Why R and Data Science Projects Need AI Rules

R is unique among programming languages: it's primarily used for statistical analysis, data visualization, and research — not for building applications. This means the AI's mistakes are different: instead of security vulnerabilities and performance bugs, you get irreproducible analyses, mixed coding dialects (base R vs tidyverse vs data.table), and scripts that only work on one person's machine.

The most common AI failures in R: generating base R when the project uses tidyverse (or vice versa), ignoring renv/packrat for dependency management, creating analysis scripts instead of reusable functions, using absolute file paths, hardcoding data transformations instead of making them parameterized, and generating plots with base graphics when the project uses ggplot2.

Data science has an additional challenge: reproducibility. An analysis that can't be reproduced by another analyst on another machine is scientifically useless. AI rules for R must enforce reproducibility patterns from the start.

Rule 1: Tidyverse vs Base R

The rule: 'This project uses tidyverse. Use dplyr for data manipulation (filter, mutate, summarize, group_by), tidyr for reshaping (pivot_longer, pivot_wider), ggplot2 for all visualizations, readr for file I/O, stringr for string operations, and purrr for functional programming. Never use base R equivalents: no subset(), no apply(), no paste0() for string operations, no plot() for visualizations.'

For the pipe: 'Use the native pipe |> (R 4.1+) or magrittr pipe %>% — be consistent, pick one. Pipe data through transformation chains: data |> filter(age > 18) |> mutate(bmi = weight / height^2) |> summarize(mean_bmi = mean(bmi)). Every transformation should be a piped step, not a nested function call.'

If your project uses data.table instead: 'This project uses data.table for performance. Use [.data.table syntax: dt[i, j, by]. Use := for modification by reference. Use .SD for subset operations. Never convert to tibble for tidyverse operations — keep data in data.table throughout.'

dplyr for manipulation — filter, mutate, summarize, group_by, join
tidyr for reshaping — pivot_longer, pivot_wider, unnest, separate
ggplot2 for all plots — never base plot(), hist(), or barplot()
readr for file I/O — read_csv over read.csv, write_csv over write.csv
stringr for strings — str_detect, str_replace over grep, gsub
Native pipe |> or %>% — be consistent, pick one for the project

💡 Declare Your Dialect

R has three main dialects: tidyverse, data.table, and base R. Tell the AI which one: 'This project uses tidyverse.' Without this, the AI mixes all three — the worst possible outcome.

Rule 2: Reproducibility Patterns

The rule: 'Use renv for dependency management — every project has a lockfile (renv.lock) that pins exact package versions. Never use install.packages() in analysis scripts — packages are installed via renv::restore(). Use relative paths for all file references — never absolute paths like /Users/alice/data/. Use here::here() for project-root-relative paths.'

For random processes: 'Set seeds before any random operation: set.seed(42). Document the seed in the analysis. Use withr::with_seed() for scoped random state. The same script with the same data and the same seed must produce identical results on any machine.'

For environment: 'Document the R version in the project. Use renv::snapshot() after adding packages. Include renv.lock in version control. Use Docker with rocker images for full environment reproducibility. Never rely on packages installed system-wide — the project's renv library is self-contained.'

⚠️ No Absolute Paths

Absolute paths like /Users/alice/data/ mean the analysis only runs on Alice's machine. Use here::here() for project-relative paths and renv for dependencies. Reproducibility is non-negotiable in data science.

Rule 3: R Package Structure for Analysis

The rule: 'For reusable analysis code, use the R package structure even if you're not publishing to CRAN. Put functions in R/ directory, tests in tests/testthat/, data in data/ or data-raw/. Use roxygen2 for function documentation. Use devtools::load_all() during development. This gives you: namespaces, dependency management, testing infrastructure, and documentation — for free.'

For analysis scripts: 'Keep analysis scripts in analysis/ or vignettes/. Scripts call functions from the package — they don't contain function definitions. Each script produces a specific output (a table, a plot, a model). Scripts are parameterized — use command-line arguments or YAML config, not hardcoded values.'

AI assistants create monolithic scripts with everything inline — data loading, transformation, analysis, and visualization in one 500-line file. The package structure forces separation of reusable logic (package) from specific analysis (scripts).

Rule 4: Quarto and Literate Programming

The rule: 'Use Quarto (.qmd) for reports that combine code, results, and narrative. Use code chunks with explicit labels: {r load-data}. Set chunk options for reproducibility: cache = TRUE for expensive operations, echo = FALSE for production reports, message = FALSE and warning = FALSE for clean output. Use cross-references for figures and tables.'

For figures: 'Every figure has: a descriptive label (fig-survival-curve), a caption, alt text for accessibility, and consistent dimensions (set fig-width and fig-height). Save publication-quality figures with ggsave() at 300 DPI minimum. Use a consistent theme: theme_minimal() or a custom project theme applied to all plots.'

For tables: 'Use gt or kableExtra for formatted tables. Never print raw data frames in reports. Tables have captions, column labels, and appropriate formatting (rounded numbers, comma separators, percentage symbols).'

Quarto (.qmd) for all reports — not raw R scripts
Labeled code chunks: {r load-data} — never unnamed chunks
cache = TRUE for expensive operations — reproducible re-runs
ggplot2 theme consistency — custom project theme for all plots
gt/kableExtra for tables — never raw print(df) in reports
ggsave at 300 DPI for publication — consistent fig-width/fig-height

Rule 5: Tidy Data Principles

The rule: 'All data transformations produce tidy data: each variable is a column, each observation is a row, each type of observational unit is a table. Use pivot_longer to convert wide data to long. Use pivot_wider for the reverse. Join data with left_join, inner_join — never merge() from base R. Document data transformations as a pipeline, not scattered mutations.'

For data validation: 'Validate data at the beginning of every analysis: check for expected columns, expected types, reasonable value ranges, and missing data patterns. Use assertr or pointblank for programmatic data validation. Log validation results — don't silently drop rows.'

For naming: 'Column names use snake_case: patient_id, measurement_date, body_mass_index. Never use spaces or special characters in column names. Use clean_names() from janitor to standardize imported data. Consistent naming across all datasets in the project.'

ℹ️ Tidy Data First

Every variable a column, every observation a row. If your data isn't tidy, tidy it first with pivot_longer/pivot_wider. All downstream analysis — plots, models, summaries — becomes dramatically simpler.

Complete R/Data Science Rules Template

Consolidated rules for R data science projects.

Tidyverse: dplyr, tidyr, ggplot2, readr, stringr, purrr — or data.table if chosen
Native pipe |> for all transformation chains — consistent throughout project
renv for dependency management — renv.lock in version control — relative paths only
set.seed() before random operations — same data + same seed = same results
R package structure for reusable code — analysis scripts call package functions
Quarto for reports — labeled chunks, cached operations, cross-references
Tidy data: each variable a column, each observation a row — pivot for reshaping
lintr + styler for code style — testthat for testing — roxygen2 for documentation