ADR-0001: DuckDB as the primary warehouse¶

Status: Accepted · 2026-02 · reaffirmed 2026-07 for the local-only platform.

Context¶

Databox is a single-operator data platform. Three hard constraints shape every infrastructure choice:

Zero fixed cost. No always-on database instance, no managed warehouse contract, no per-seat BI licensing.
Zero-infra local mode. A new contributor (or a future me on a new laptop) should be productive in minutes, not after provisioning anything.
Analytical workload profile. The data is batch-ingested daily, modeled with SQL, and queried by a handful of dashboards. No OLTP, no high concurrency, no real-time writes.

Candidate stacks considered: Postgres + dbt, Snowflake (free tier), BigQuery (free tier), ClickHouse, DuckDB, Parquet-on-S3 + Athena.

Decision¶

Use DuckDB as the primary warehouse. All transformations run against DuckDB; all marts live in DuckDB files under data/.

Consequences¶

Positive: - Zero infra. The database is a .duckdb file on disk. - Analytical queries against the flagship mart finish in tens of milliseconds locally. No tuning, no clustering keys. - Columnar storage + vectorized execution make this genuinely fast for the workload, not just "fine for small data". - First-class Parquet, JSON, and HTTP extensions mean dlt and downstream consumers can read/write a wide variety of formats without glue code. - Open source (MIT). No vendor lock-in.

Negative: - Direct multi-process writes to the same .duckdb file are still not the local model; concurrent local ingest goes through Quack (ADR-0007). - No row-level security, no fine-grained access control. Out of scope for this project; would be disqualifying at enterprise scale. - The .duckdb file is binary and doesn't diff cleanly — source of truth is always the raw ingest + transform code, never the database file itself.