Databox¶
Dataset-agnostic single-operator data platform. Ingests bird, weather, and streamflow sources with dlt, transforms with SQLMesh, validates with Soda contracts, and orchestrates with Dagster on DuckDB / MotherDuck.
What's here¶
- Data dictionary — every model, its columns and types, the Soda contract in effect, and direct lineage. Auto-generated from SQLMesh and Soda metadata.
- Lineage — full model dependency graph rendered with Mermaid. Each node links to its dictionary page.
- Metrics — semantic metrics layer over the flagship mart
(
analytics.fct_species_environment_daily). - Analytics examples — representative queries the flagship mart supports.
- Contracts — Soda quality contract conventions.
- Incremental loading — dlt incremental and SQLMesh incremental-by-time notes.
Architecture decisions¶
Six backfilled ADRs (Nygard format) explain the load-bearing choices:
- ADR-0001 — DuckDB as the primary warehouse
- ADR-0002 — SQLMesh over dbt
- ADR-0003 — Single SQLMesh project across all sources
- ADR-0004 — Per-source raw DuckDB catalogs
- ADR-0005 — Dagster as the sole orchestrator
- ADR-0006 — MotherDuck as the cloud path
The root README frames the platform as a case study with system and data-flow diagrams in Mermaid.
Regenerate¶
Everything under dictionary/ is generated from the repo — do not hand-edit. Rebuild with:
uv run python scripts/generate_docs.py
Target runtime: under 30 seconds; observed runtime: ~1–2 seconds for the current model set.