Skip to content

Source Layout Convention

Every registered dlt source in Databox follows the same on-disk shape. Consistency means new sources cost a predictable amount to add, drift is visible to CI, and the new-source generator has a precise target.

scripts/check_source_layout.py enforces the convention — it runs as the source-layout-lint CI job on every PR and as python scripts/check_source_layout.py locally.

The shape

For a source named <name> (e.g. ebird, noaa, usgs):

packages/databox-sources/databox_sources/<name>/
  ├── source.py              # dlt @source / @resource definitions
  └── config.yaml            # pipeline config (dataset, schedule hints)

transforms/main/models/<name>/
  ├── staging/
  │   └── stg_*.sql          # at least one staging model
  └── marts/
      └── (fct_*|dim_*).sql  # at least one mart model

soda/contracts/<name>_staging/
  └── *.yaml                 # at least one staging contract

soda/contracts/<name>/
  └── *.yaml                 # at least one mart contract

packages/databox/databox/orchestration/domains/<name>.py
                             # Dagster assets, schedules, asset checks

Intermediate models under transforms/main/models/<name>/intermediate/ are optional — the lint does not require them.

What the linter checks

The script walks packages/databox-sources/databox_sources/*/ looking for directories that contain a source.py. For each, it asserts the seven components above exist and contain at least one matching file.

Output format is line-oriented so CI logs stay diffable:

  ✓ ebird
  ✗ noaa
      missing: soda/contracts/noaa/*.yaml
  ✓ usgs

2 ok · 0 skipped · 1 failing (of 3)

--json emits the same data in machine-readable form for generator tooling.

Escape hatch: scaffold-lint: skip

Experimental or in-flight sources that don't yet satisfy the full layout can opt out by adding a line within the first 10 lines of source.py:

# scaffold-lint: skip=experimental

The reason after = is free text — common values: experimental, in-flight, wip-domain-refactor. Skipped sources appear in the lint output marked ~ (skipped: <reason>) but do not fail CI.

Do not use the skip marker to silence drift in a finished source. If the lint complains about an existing source, the right answer is almost always "add the missing file" — the convention exists because each component has a concrete job.

Why each file is required

Component Why it is required
source.py Anchor file — if this doesn't exist, the source isn't loadable
config.yaml Pipeline configuration (dataset name, schedule hints, source-specific options)
staging/stg_*.sql At least one staging model turning raw into typed
marts/(fct_*|dim_*).sql At least one consumer-facing mart
soda/contracts/<name>_staging/ Data-quality contract for the staging layer
soda/contracts/<name>/ Data-quality contract for the mart layer
domains/<name>.py Dagster wiring — assets, schedules, asset checks

Dropping any of these creates a source that half-works. The lint makes that state unshippable.

New-source generator

ticket:new-source-generator (Phase 2) will scaffold this layout given just the source name. Whatever the linter requires is what the generator creates — the two stay in lockstep.

Until that ticket lands, adding a source by hand means copying the shape above. See CLAUDE.md for the current manual checklist.

Typing the resource boundary

Sources that hit external APIs should validate each yielded record through a Pydantic model so upstream schema drift fails closed at extract. See source-typing.md for the convention and the eBird RecentObservation pilot.