Incremental Loading¶

Every dlt source in Databox writes to DuckDB (local or MotherDuck) using a declared primary key and write disposition. This page documents, per resource:

write disposition — how new rows land: merge, replace, or append
primary / merge key — how dlt deduplicates on re-run
watermark — how the extract window is bounded (date-based lookback, or none)
idempotency guarantee — what re-running the same load against the same upstream data does to the final row set
backfill command — exact command to reload a wider window

Summary¶

Source	Resource	Disposition	Primary Key	Watermark
ebird	`recent_observations`	merge	`subId`	rolling `days_back` window
ebird	`notable_observations`	merge	`subId`	rolling `days_back` window
ebird	`species_list`	replace	`speciesCode`	none — full snapshot
ebird	`hotspots`	merge	`locId`	none — full snapshot
ebird	`taxonomy`	replace	`sciName`	none — full snapshot
ebird	`region_stats`	merge	`(regionCode, year, month, day)`	rolling `days_back` window
noaa	`daily_weather`	merge	`(date, datatype, station)`	rolling `days_back` window, chunked to 365-day API calls
noaa	`stations`	merge	`id`	none — full snapshot
noaa	`datasets`	replace	`id`	none — full snapshot
usgs	`daily_values`	merge	`(site_no, parameter_cd, observation_date)`	rolling `days_back` window, chunked to 90-day API calls
usgs	`sites`	merge	`site_no`	none — full snapshot

Idempotency model¶

No resource uses dlt's dlt.sources.incremental cursor. Instead, every resource re-fetches a bounded window on each run and relies on the merge primary key to deduplicate. Consequences:

merge-disposition resources are fully idempotent: re-running with the same API responses leaves the final table untouched.
replace-disposition resources (species_list, taxonomy, datasets) drop and reload the table every run; idempotency is trivial because the row set is always "whatever the API returned this time."
No resource uses append — we accept the rate-limit cost of re-fetching over the complexity of durable cursor state.

The idempotency guarantee is validated in CI by packages/databox-sources/tests/<source>/test_idempotency.py: each test runs the merge-backed resource twice against the same VCR cassette and asserts the primary-key set and row count are identical.

Backfill procedure¶

All three sources read their extract window from a per-source env-var override (default 30) in packages/databox/databox/orchestration/definitions.py:

DATABOX_EBIRD_DAYS_BACK
DATABOX_NOAA_DAYS_BACK
DATABOX_USGS_DAYS_BACK

To widen the window for one run, set the env var before launching Dagster and materialize the raw asset group:

# Pull eBird data for the last 90 days instead of the default 30
DATABOX_EBIRD_DAYS_BACK=90 uv run dagster asset materialize \
    --select 'group:ebird_ingestion' \
    -m databox.orchestration.definitions

# Full-year NOAA backfill (hits multiple 365-day chunks; expect 5-10 min)
DATABOX_NOAA_DAYS_BACK=365 uv run dagster asset materialize \
    --select 'group:noaa_ingestion' \
    -m databox.orchestration.definitions

# USGS daily values for the last year (chunked to 90-day API calls)
DATABOX_USGS_DAYS_BACK=365 uv run dagster asset materialize \
    --select 'group:usgs_ingestion' \
    -m databox.orchestration.definitions

Or launch the Dagster UI with the env var set and materialize the group interactively — the UI-driven path is what the operator uses day-to-day.

The merge disposition means a backfill never duplicates rows already present at the narrower window — it only fills in older dates. If the API has retroactively revised a row (which NOAA does for GHCND), the merge updates in place.

Blast radius¶

merge resources: only the rows whose primary keys appear in the new window are touched. Existing rows outside the window are untouched.
replace resources: the entire table is dropped and reloaded on every run, so a backfill on days_back has no effect on these (they always reflect the current full snapshot).
downstream SQLMesh marts: all are declarative views/tables over the raw layer. A backfill at the source layer is picked up on the next task full-refresh (or the next scheduled Dagster run).

Dagster backfill¶

A per-source Dagster partition backfill is not wired yet — the current orchestration layer schedules one daily materialization per source. For ad-hoc historical loads, invoke dagster asset materialize directly (see commands above) or materialize the raw asset via the Dagster UI with a custom days_back config override.

See packages/databox/databox/orchestration/definitions.py for the asset definitions that would host a future partition scheme.

When to rely on merge vs replace¶

merge: the source API returns a durable, point-in-time row identified by a stable key (subId, (date, datatype, station), (site_no, parameter_cd, observation_date)). Revisions are in-place updates.
replace: the source API returns a catalog or reference list that should exactly mirror upstream (taxonomy, dataset manifest, species list for a region). Drift between a stale merge table and a current catalog would be worse than the cost of a full reload.