Skip to content

ADR-0004: Per-source raw DuckDB catalogs

Status: Accepted · 2026-03

Context

dlt writes ingested records into a landing zone. Early versions used a single catalog (data/databox.duckdb) with per-source schemas (raw_ebird.*, raw_noaa.*, raw_usgs.*). This worked for correctness but serialized all loads:

  • DuckDB holds a file-level lock for writes. Concurrent pipelines blocked each other.
  • A failed load partially touched the same file that downstream transforms were about to read.
  • The transform-side DuckDB file grew large, and checkpoint/rollback semantics interacted awkwardly with dlt's own staging commit.

Decision

Split raw storage into per-source DuckDB files:

data/raw_ebird.duckdb
data/raw_noaa.duckdb
data/raw_usgs.duckdb
data/databox.duckdb   # transform output catalog only

settings.raw_<source>_path returns the source-specific path for both local and MotherDuck backends. SQLMesh staging models ATTACH each raw file read-only as needed.

Consequences

Positive: - dlt pipelines run in parallel without lock contention. The full refresh time drops substantially when multiple sources ingest together. - A bad ingest for one source cannot corrupt another source's raw data. - Each source's raw file can be deleted/rebuilt independently during debugging. - The transform catalog stays small — it holds only modeled data.

Negative: - Multiple files to back up / sync to MotherDuck instead of one. - SQL that crosses raw sources must ATTACH all relevant files, which is slightly more verbose than referencing shared schemas. - list_databases / list_tables tooling needs to know where to look, not just open one file.

Neutral: - On MotherDuck, "files" become "databases" in the same account, so the per-source split carries over naturally without the filesystem-level lock concern (MotherDuck handles that internally).