Incremental Loading¶
Every dlt source in Databox writes to DuckDB (local or MotherDuck) using a declared primary key and write disposition. This page documents, per resource:
- write disposition — how new rows land:
merge,replace, orappend - primary / merge key — how dlt deduplicates on re-run
- watermark — how the extract window is bounded (date-based lookback, or none)
- idempotency guarantee — what re-running the same load against the same upstream data does to the final row set
- backfill command — exact command to reload a wider window
Summary¶
| Source | Resource | Disposition | Primary Key | Watermark |
|---|---|---|---|---|
| ebird | recent_observations |
merge | subId |
rolling days_back window |
| ebird | notable_observations |
merge | subId |
rolling days_back window |
| ebird | species_list |
replace | speciesCode |
none — full snapshot |
| ebird | hotspots |
merge | locId |
none — full snapshot |
| ebird | taxonomy |
replace | sciName |
none — full snapshot |
| ebird | region_stats |
merge | (regionCode, year, month, day) |
rolling days_back window |
| noaa | daily_weather |
merge | (date, datatype, station) |
rolling days_back window, chunked to 365-day API calls |
| noaa | stations |
merge | id |
none — full snapshot |
| noaa | datasets |
replace | id |
none — full snapshot |
| usgs | daily_values |
merge | (site_no, parameter_cd, observation_date) |
rolling days_back window, chunked to 90-day API calls |
| usgs | sites |
merge | site_no |
none — full snapshot |
Idempotency model¶
No resource uses dlt's dlt.sources.incremental cursor. Instead, every
resource re-fetches a bounded window on each run and relies on the merge
primary key to deduplicate. Consequences:
- merge-disposition resources are fully idempotent: re-running with the same API responses leaves the final table untouched.
- replace-disposition resources (
species_list,taxonomy,datasets) drop and reload the table every run; idempotency is trivial because the row set is always "whatever the API returned this time." - No resource uses
append— we accept the rate-limit cost of re-fetching over the complexity of durable cursor state.
The idempotency guarantee is validated in CI by
packages/databox-sources/tests/<source>/test_idempotency.py: each test
runs the merge-backed resource twice against the same VCR cassette and
asserts the primary-key set and row count are identical.
Backfill procedure¶
All three sources read their extract window from a per-source env-var
override (default 30) in
packages/databox/databox/orchestration/definitions.py:
DATABOX_EBIRD_DAYS_BACKDATABOX_NOAA_DAYS_BACKDATABOX_USGS_DAYS_BACK
To widen the window for one run, set the env var before launching Dagster and materialize the raw asset group:
# Pull eBird data for the last 90 days instead of the default 30
DATABOX_EBIRD_DAYS_BACK=90 uv run dagster asset materialize \
--select 'group:ebird_ingestion' \
-m databox.orchestration.definitions
# Full-year NOAA backfill (hits multiple 365-day chunks; expect 5-10 min)
DATABOX_NOAA_DAYS_BACK=365 uv run dagster asset materialize \
--select 'group:noaa_ingestion' \
-m databox.orchestration.definitions
# USGS daily values for the last year (chunked to 90-day API calls)
DATABOX_USGS_DAYS_BACK=365 uv run dagster asset materialize \
--select 'group:usgs_ingestion' \
-m databox.orchestration.definitions
Or launch the Dagster UI with the env var set and materialize the group interactively — the UI-driven path is what the operator uses day-to-day.
The merge disposition means a backfill never duplicates rows already present at the narrower window — it only fills in older dates. If the API has retroactively revised a row (which NOAA does for GHCND), the merge updates in place.
Blast radius¶
- merge resources: only the rows whose primary keys appear in the new window are touched. Existing rows outside the window are untouched.
- replace resources: the entire table is dropped and reloaded on every
run, so a backfill on
days_backhas no effect on these (they always reflect the current full snapshot). - downstream SQLMesh marts: all are declarative views/tables over the raw
layer. A backfill at the source layer is picked up on the next
task full-refresh(or the next scheduled Dagster run).
Dagster backfill¶
A per-source Dagster partition backfill is not wired yet — the current
orchestration layer schedules one daily materialization per source.
For ad-hoc historical loads, invoke dagster asset materialize directly
(see commands above) or materialize the raw asset via the Dagster UI with
a custom days_back config override.
See packages/databox/databox/orchestration/definitions.py
for the asset definitions that would host a future partition scheme.
When to rely on merge vs replace¶
- merge: the source API returns a durable, point-in-time row identified
by a stable key (
subId,(date, datatype, station),(site_no, parameter_cd, observation_date)). Revisions are in-place updates. - replace: the source API returns a catalog or reference list that should exactly mirror upstream (taxonomy, dataset manifest, species list for a region). Drift between a stale merge table and a current catalog would be worse than the cost of a full reload.