Incident runbook¶
Four scenarios a solo operator is most likely to hit on this stack. Each section follows the same shape: symptoms → diagnosis → recovery → validation → rollback. Cross-references link to the deeper docs where relevant.
Pair with docs/environments.md (dev → prod loop) and docs/freshness.md (post-recovery SLA resume).
1. Blown DuckDB file (local backend)¶
Symptoms¶
duckdb.IOException: Could not read from fileon any Dagster run.dagster asset materializeexits non-zero immediately; traceback points atdatabox.config.settings.database_path.sqlmesh planfails withCatalog Error: Table ... does not existeven though the model code is unchanged.df -hnear 100% on the volume holdingdata/.
Diagnosis¶
# File readable at all?
uv run python -c "import duckdb; duckdb.connect('data/databox.duckdb', read_only=True).execute('SELECT 1').fetchall()"
# Disk space?
df -h data/
# SQLMesh state vs physical tables agree?
cd transforms/main && uv run sqlmesh audit
If the connect fails, the file is corrupt. If audit reports "table missing" but the file opens, SQLMesh state desynced.
Recovery¶
Path A — rebuild from raw dlt state (preferred; most scenarios).
# 1. Back up the broken file for forensics.
mv data/databox.duckdb data/databox.duckdb.broken
# 2. Drop SQLMesh state so it does not look for phantom physical tables.
rm -rf transforms/main/.sqlmesh transforms/main/state_sync.db
# 3. Rehydrate raw catalogs + rebuild every mart.
task full-refresh
dlt reloads raw_ebird / raw_noaa / raw_usgs from the APIs
(idempotent — primary keys are declared). SQLMesh then rebuilds
every staging view and mart against the fresh raw data.
Path B — restore from a file backup (if you keep one).
mv data/databox.duckdb data/databox.duckdb.broken
cp /path/to/backup/databox.duckdb data/databox.duckdb
cd transforms/main && uv run sqlmesh plan prod --auto-apply
The plan prod call reconciles SQLMesh state against whatever
table versions are in the restored file.
Validation¶
task verify # smoke full-refresh with DATABOX_SMOKE=1
task verify:dev # Soda contracts against __dev
cd transforms/main && uv run sqlmesh audit
All three must exit zero. Open the Dagster UI and confirm the
freshness checks for fct_daily_bird_observations and
fct_daily_weather read "ok".
Rollback¶
If full-refresh itself fails:
mv data/databox.duckdb data/databox.duckdb.rebuild-failed
mv data/databox.duckdb.broken data/databox.duckdb
You are back where you started. Investigate the failure with
uv run dagster asset materialize --select <failing_asset> and
the run logs in .dagster/.
2. Partial backfill of a single source¶
Symptoms¶
- One source had a bad ingest window (API returned 500s, upstream corrected a prior day, your code had a bug for one run).
- Mart values for those dates are wrong; every other source is fine.
Diagnosis¶
# Identify the date range.
uv run duckdb data/databox.duckdb "SELECT day_date, COUNT(*)
FROM ebird.fct_daily_bird_observations
WHERE day_date BETWEEN DATE '2026-04-18' AND DATE '2026-04-20'
GROUP BY 1 ORDER BY 1"
If row counts are zero / absurdly low / spiking, you have a window to replace.
Recovery¶
# 1. Drop the affected raw window so dlt reloads it.
uv run duckdb data/databox.duckdb "DELETE FROM raw_ebird.observations
WHERE observation_date BETWEEN DATE '2026-04-18' AND DATE '2026-04-20'"
# 2. Rerun the dlt source for that window (set DATABOX_BACKFILL_START/END
# if the source honours them; otherwise bounded full reload is fine).
uv run dagster asset materialize \
--select 'ebird_dlt_assets' \
-f packages/databox/databox/orchestration/definitions.py
# 3. Restate the affected SQLMesh models for the window only.
cd transforms/main
uv run sqlmesh plan prod \
--start 2026-04-18 \
--end 2026-04-20 \
--restate-model 'ebird.fct_daily_bird_observations' \
--restate-model 'analytics.fct_species_environment_daily' \
--auto-apply
SQLMesh recomputes only the affected partitions of the restated models. Other sources and other dates are untouched.
Validation¶
# Row counts in the restated window should now match expectations.
uv run duckdb data/databox.duckdb "SELECT day_date, COUNT(*)
FROM ebird.fct_daily_bird_observations
WHERE day_date BETWEEN DATE '2026-04-18' AND DATE '2026-04-20'
GROUP BY 1 ORDER BY 1"
# Re-run Soda contracts against prod.
uv run dagster asset-check execute \
--select '*' \
-f packages/databox/databox/orchestration/definitions.py
Every Soda check for ebird.* and analytics.* must pass.
Rollback¶
sqlmesh plan history is your undo. List prior plan IDs and
restore:
cd transforms/main
uv run sqlmesh state list # find the previous plan ID
uv run sqlmesh plan prod --restore-from <previous-plan-id> --auto-apply
See docs/environments.md#escape-hatches.
3. MotherDuck point-in-time recovery¶
Symptoms¶
- Bad write landed on prod MotherDuck (mass delete, wrong restate window, accidental DROP). You want the database back to a known-good moment.
analytics.*rows are missing or mutated where they should be stable.
Diagnosis¶
-- Run in MotherDuck UI or via uv run duckdb md:
SELECT snapshot_id, snapshot_name, created_at
FROM md_information_schema.database_snapshots
WHERE database_name = 'databox'
ORDER BY created_at DESC
LIMIT 20;
Snapshots are automatic (Business plan default 7d retention, up to 90d; Lite paid 1d; Lite free none — confirm your plan in the MotherDuck UI). Named snapshots persist until removed regardless of retention.
Recovery¶
Two options. Prefer Path A — it leaves the broken state inspectable.
Path A — clone to a new database, validate, then repoint.
-- 1. Clone a historical state to a new database.
CREATE DATABASE databox_restored FROM databox
(SNAPSHOT_TIME '2026-04-20 12:00:00');
-- 2. Spot-check row counts against your last-known-good date.
SELECT COUNT(*) FROM databox_restored.ebird.fct_daily_bird_observations
WHERE day_date = DATE '2026-04-19';
-- 3. Rename databases to swap. Keep the broken one for forensics.
ALTER DATABASE databox RENAME TO databox_broken;
ALTER DATABASE databox_restored RENAME TO databox;
Path B — overwrite in place (destructive; only after forensics).
ALTER DATABASE databox SET SNAPSHOT TO
(SNAPSHOT_TIME '2026-04-20 12:00:00');
Validation¶
# Point local tooling at the restored MotherDuck database.
export DATABOX_BACKEND=motherduck
# Smoke + contracts.
task verify
task verify:dev
Confirm Dagster freshness checks pass post-restore (no synthetic gap beyond the restore point).
Rollback¶
Path A is reversible because the broken DB is still named
databox_broken:
ALTER DATABASE databox RENAME TO databox_bad_restore;
ALTER DATABASE databox_broken RENAME TO databox;
Path B is reversible only if you cloned a forensic copy before running it. Always prefer Path A on prod data.
Retention gotcha¶
Named snapshots persist until you drop them. Anonymous
(auto) snapshots age out per snapshot_retention_days. If you
know a window is load-bearing, create a named snapshot before
any destructive operation:
CREATE SNAPSHOT 'pre_mart_rewrite_20260421' OF databox;
4. Paused-schedule resumption¶
Symptoms¶
- Dagster daemon was stopped for hours / days.
- Freshness checks red across every source.
Diagnosis¶
# Daemon status?
uv run dagster instance show
# When did each schedule last tick?
uv run dagster schedule list
If daemon is stopped or last-tick timestamps are stale, the schedule is paused.
Recovery¶
# 1. Start the daemon (dagster:dev does both UI + daemon for local use;
# use --workspace + `dagster-daemon run` in a prod deployment).
task dagster:dev # local
# or
uv run dagster-daemon run # prod
# 2. Do NOT rely on missed-tick catch-up — Dagster's default is to
# skip missed cron fires. Trigger an explicit backfill for the
# missed window instead.
uv run dagster job backfill \
--job all_pipelines \
--partition-set-name '*' \
--from 2026-04-18 --to 2026-04-20 \
-f packages/databox/databox/orchestration/definitions.py
Validation¶
# Every freshness check must return to green within its SLA window
# (see docs/freshness.md for the per-asset deadlines).
task verify
Open the Dagster UI → Assets → filter by last_materialized and
confirm every mart's latest materialization is within its SLA
timedelta.
Rollback¶
Nothing to undo — these operations are additive / idempotent. If a resumed run itself fails, treat it as a fresh incident and use scenario 2 (partial backfill) for the affected window.
Prevention appendix¶
Each scenario maps to the pre-commit / CI / runtime checks that would catch (or have caught) its underlying class of error.
| Scenario | Catching check | Where |
|---|---|---|
| Blown DuckDB file | check-added-large-files pre-commit |
blocks accidental commit of a 10 GB .duckdb |
| Blown DuckDB file (disk full) | df monitoring + ADR-0001 note on DuckDB as file |
gap — no alert today; see follow-up below |
| Partial backfill | schema-contract-ci (scripts/schema_gate.py) |
blocks contract changes that would silently drop rows |
| Partial backfill | Soda row_count + freshness checks |
asset-check failures block downstream materialization |
| MotherDuck PIT | Named snapshot before destructive ops | gap — no automation today; documented above |
| MotherDuck PIT | Soda contracts on every mart | fails fast if a restore produces wrong shape |
| Paused schedule | freshness_violation_sensor |
logs a warning line per missed SLA (docs/freshness.md) |
| Paused schedule | Dagster daemon health check | gap — no external heartbeat today |
Known gaps (follow-up tickets)¶
- Pre-destructive named snapshot: a wrapper around risky SQLMesh
restates that auto-creates
CREATE SNAPSHOTon MotherDuck would remove a common footgun. - Daemon heartbeat: today, a stopped daemon is only noticed when freshness checks fire. An external uptime ping against the Dagster webserver would catch it faster.
None of these are in scope for the scaffold-polish initiative — they are called out here so a forker knows what trust boundaries exist.