The Scientific Computing Workflow

EPI XXX: Practical Computing for Population Health Research — Session 01

March 16, 2026

Learning Objectives

By the end of this lecture, you should be able to:

  1. Articulate why scientific computing workflows matter for transparent, reproducible research.
  2. Distinguish between “workflow” (personal habits) and “product” (the code, data, and outputs that others see).
  3. Set up a project-oriented directory structure for a research analysis.
  4. Use here::here() to construct file paths instead of setwd().
  5. Write a minimal reproducible example (reprex) to get help effectively.

Course Overview and Motivation

What this course is about

  • The gap: epi training teaches methods (regression, survival analysis, causal inference) but not the infrastructure that makes them reproducible, transparent, and efficient
  • You learn how to fit a Cox model, but not how to organize the project that contains it
  • This course fills that gap. Not a statistics course, not a methods course, not a coding course — it is about the infrastructure of quantitative research
  • Primary language: R + tidyverse, but principles apply to any language

Why “good enough” practices?

Best Practices (2014)

  • Wilson et al., PLOS Biology
  • Targets researchers who write substantial code
  • Aspirational but impractical for most of us

Good Enough (2017)

  • Wilson et al., PLOS Comp Bio
  • Scaled to an achievable bar for all researchers
  • Small changes → big improvements

This course focuses on the “good enough” practices and will sometimes expose you to the aspirational “best practices.”

Course structure

  • 10 content weeks + finals period for project presentations
  • 2 sessions/week (90 min each): one lecture, one lab
  • This week is an exception: two lectures, no lab
    • Session 01 (today): Scientific computing workflow
    • Session 02 (next meeting): Shell & file system fundamentals
    • Labs start Week 2
  • Final project: scaffolded assignments (Weeks 1–5), then independent project (Weeks 6–10)

The Research Computing Lifecycle

A project is more than a manuscript

Every manuscript integrates hundreds of decisions — which records to exclude, how to handle missing data, which model specification you settled on after trying alternatives.

These decisions are the intellectual core of the analysis. In many projects, they exist only in the analyst’s memory or uncommented scripts.

A research project includes the code, the data, the computational environment, and the documentation — the manuscript is one output among several.

The “works on my machine” problem

Scripts break when they move between computers. The script might depend on:

  • Absolute file paths that exist on one machine only
  • Undocumented package dependencies (installed but never documented)
  • Assumed working directory or OS-specific behavior
  • Manual data manipulation steps not captured in code

Environment vs. requirements

These are not exotic bugs. They are the regular consequences of conflating your personal computing environment with the project’s requirements.

Sandve et al. distill reproducibility to a handful of rules:

  1. Avoid manual data manipulation
  2. Record all intermediate results
  3. For every result, keep track of how it was produced

These rules sound obvious but are rarely applied.

The research computing lifecycle

The research computing lifecycle.

What a well-organized project looks like

Well-organized

mortality_analysis/
├── mortality_analysis.Rproj
├── README.md
├── config.yml
├── renv.lock
├── .gitignore
├── code/
│   ├── 01_ingest_raw_data.R
│   ├── 02_create_analytic_data.R
│   ├── 03_fit_models.R
│   ├── 04_fig1_trends.R
│   └── utils.R
├── data/
├── data_raw/
├── data_private/
├── output/
├── plots/
├── qmd/
├── lit/
└── manuscript/

Disorganized

stuff/
├── analysis_FINAL.R
├── analysis_FINAL_v2.R
├── analysis_FINAL_v2_ACTUALLY_FINAL.R
├── data.csv
├── data2.csv
├── data_new.csv
├── fig1.png
├── Untitled.R
└── notes.docx

. . .

The first directory tells you what the project contains, how the code should be executed, and where to look for results. The second tells you almost nothing.

Project-Oriented Workflows

The setwd() anti-pattern

If the first line of your R script is setwd("C:\Users\jenny\..."), I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

Jenny Bryan

# BAD: This only works on one person's machine
setwd("/Users/matt/Dropbox/projects/mortality_analysis")
dat <- read.csv("data/raw_deaths.csv")
# Note: read.csv() is also not recommended — use readr::read_csv()
  • The absolute path doesn’t exist on anyone else’s computer
  • Creates an invisible dependency on your file system
  • The script looks self-contained but is not

The rm(list = ls()) myth

# BAD: This does NOT give you a clean slate.
rm(list = ls())

rm(list = ls()) clears objects but does NOT:

  • Unload packages (library() calls persist)
  • Reset options() you changed
  • Close database connections
  • Reset the working directory

The only reliable clean slate — restart R

  • Windows/Linux: Ctrl+Shift+F10
  • macOS: Cmd+Shift+F10

RStudio setting

Go to Tools → Global Options → General: uncheck “Restore .RData into workspace at startup” and set “Save workspace to .RData on exit” to Never. This ensures every R session starts clean.

RStudio Projects and .Rproj files

.Rproj files set the working directory for you. Open one, and RStudio:

  • Sets the working directory to that folder
  • Launches a fresh R session
  • Loads any project-specific settings

No hard-coded paths needed.

To create one: File → New Project → New Directory (or Existing Directory). The .Rproj file sits at the root of your project directory and acts as an anchor.

Tip

Open RStudio, create a new project, and confirm here::here() returns the project root. Follow along.

The here package

library(here)

# Finds the project root automatically
here::here()
#> [1] "/Users/matt/projects/mortality_analysis"

# Build paths relative to the project root
dat <- readr::read_csv(here::here("data_raw", "raw_deaths.csv"))

# Same pattern for saving
readr::write_csv(result_df, here::here("data", "cleaned_deaths.csv"))

How it works — walks up the directory tree to find an anchor file (.Rproj, .here, .git, among others) and builds paths from there.

Every file path in your scripts should use here::here(). Every project should have an .Rproj file.

Workflow versus product

Your workflow is personal and ephemeral — which text editor you use, how you organize your desktop, where on your hard drive you keep your projects.

Your product is what you share with the world — the R scripts, the data, the README, the manuscript.

Your product should not depend on your workflow. If your script requires knowledge of your personal file system layout to run, you have embedded your workflow into your product.

Canonical directory structure

A well-organized project separates files by function:

Directory Purpose
code/ R scripts, numbered for execution order
data_raw/ Raw input data — never modified
data/ Processed, shareable intermediate datasets
data_private/ Restricted data under DUA (gitignored)
output/ Tables, model objects, logs
plots/ Publication-ready figures (PDF, PNG)
qmd/ Quarto / R Markdown documents
lit/ Reference PDFs (gitignored)
manuscript/ Manuscript drafts

One script, one job

Each script should do exactly one thing. Read inputs, do the work, save outputs.

  • Scripts communicate through files on disk — not objects lingering in the global environment
  • If 02_clean_data.R produces data/clean_deaths.RDS, then 03_fit_models.R reads that file

Four benefits:

  1. Code review — a 60-line script can be reviewed, verified, and closed
  2. Reuse of intermediates — one cleaning step feeds multiple downstream analyses
  3. Skipping expensive steps — restart from any checkpoint
  4. Pipeline-ready — each script is already a node in a dependency graph (targets, Session 19)

A modular pipeline

Modular pipeline.

Choosing descriptive slugs

The number prefix handles order. The slug handles content.

Vague slugs (avoid)

01_data.R
02_analysis.R
03_results.R
04_figure.R

Descriptive slugs (prefer)

01_download_mortality_data.R
02_clean_county_covariates.R
03_fit_apc_models.R
04_fig_trends_by_state.R

Verb-noun pattern: the verb says what the script does; the noun says to what. The good slugs read like a pipeline summary — download, clean, model, plot.

What a real script looks like

## 01_ingest_raw_data.R ----
##
## Download raw NCHS mortality data from CDC WONDER and save
## as a compressed RDS file. Requires internet access.
## Input: CDC WONDER API
## Output: data_raw/raw_deaths_1999_2020.RDS

## Imports ----
library(tidyverse)
library(here)

## Constants ----
START_YEAR <- 1999
END_YEAR <- 2020

## Download ----
# ... (download code would go here)

## Save ----
saveRDS(raw_df, here::here("data_raw", "raw_deaths_1999_2020.RDS"),
        compress = "xz")

Conventions: header block with inputs/outputs, ## Section ---- markers for RStudio’s outline (Ctrl+Shift+O / Cmd+Shift+O), UPPER_SNAKE_CASE constants, here::here() paths.

Plain Text and Documentation

Why plain text matters

Durable

A .csv from 1995 is still readable today.

Try that with .sav (SPSS 12) or .xlsx (Excel 2003).

Portable

.R scripts run on macOS, Windows, Linux.

.csv files work in R, Python, Stata, SAS, Excel.

Version-controllable

Git tracks line-by-line changes in text files.

Binary files (.docx, .xlsx) are opaque to Git.

Your primary workflow should be plain text. Binary formats are not always wrong — but they should be outputs, not inputs.

The README: your project’s front door

Every project needs a README.md. It answers three questions: what does this project do, how do I run it, where is the data?

# Mortality Trends Analysis

## Overview
Analysis of US county-level mortality trends, 1999-2020.

## Requirements
- R >= 4.3.0
- See `renv.lock` for package dependencies

## Reproducing the Analysis
Run scripts in `code/` in numbered order:
1. `01_ingest_raw_data.R` — downloads and caches raw NCHS data
2. `02_create_analytic_data.R` — cleans and reshapes
3. `03_fit_models.R` — fits age-period-cohort models
4. `04_fig1_trends.R` — generates Figure 1

Twenty lines orient a reader completely. Update continuously as the project evolves.

A brief introduction to Quarto

  • What: open-source publishing system (by Posit) — Markdown + executable code → HTML, PDF, Word, slides
  • File extension: .qmd — plain text, version-controllable, reproducible
  • If you know R Markdown → Quarto is the next generation
  • These slides, the lecture notes, and your assignments are all written in Quarto
  • We introduce features progressively over the quarter

Getting Help: How to Make a Reprex

The problem with “it doesn’t work”

Your code will break. When it does, a minimal reproducible example (reprex) is the fastest path to effective help.

The key word is reproducible. If the person helping you cannot reproduce the problem, they cannot diagnose it.

The reprex workflow: write minimal code, format with the reprex package, share.

The reprex package

library(dplyr)

df <- tibble(
    name = c("Alice", "Bob", NA),
    score = c(90, 85, 78)
)

# I expect this to drop the NA row — but it returns 0 rows!
df |>
    filter(name != NA)

Why? Comparisons with NA using == or != always return NA, not TRUE/FALSE.

Fix:

df |>
    filter(!is.na(name))

Where to ask for help

A good reprex gets useful answers almost anywhere:

  • Course discussion board — course-related questions
  • Posit Community — R and RStudio
  • Stack Overflow — general programming
  • GitHub Issues — package-specific bugs

The venue matters less than the quality of your question.

Wrap-Up

What’s next

Session 02: Your Computer and the Shell

Reading for Session 02:

  1. Healy K. Modern Plain Text Computing, Chs. 1–3. https://mptc.io/
  2. Bryan J (2015). “How to Name Files.” Slides.

Note

No lab this week. The hands-on project setup lab begins in Session 04 after we cover Git and GitHub.

Appendix

How here::here() resolves paths

Path resolution across machines.