Staging Codegen¶
Trivial-rename staging models (transforms/main/models/*/staging/stg_*.sql) are generated from their Soda contracts rather than hand-written. The codegen exists because the staging layer is dominated by SELECT old_col AS new_col from a raw catalog — writing each one by hand scales poorly as sources grow.
Non-trivial staging (joins, UNION ALL, derived columns, filters) opts out with a skip marker and stays hand-maintained.
How it runs¶
python scripts/generate_staging.py— regenerate every non-skipped staging SQL in place.python scripts/generate_staging.py --check— exit1if any committed staging SQL differs from what the generator would emit. Run in CI as thestaging-codegen-driftjob.
The generator lives in databox.quality.staging_codegen; scripts/generate_staging.py is a thin CLI wrapper.
Contract extension¶
A codegen-driven staging contract adds two keys on top of the standard Soda layout:
dataset: databox/ebird_staging/stg_ebird_hotspots
source_table: raw_ebird.main.hotspots
description: Staging model for eBird birding hotspots
columns:
- name: location_id
source_column: loc_id # optional; defaults to `name` when omitted
checks:
- missing:
must_be: 0
- name: latitude
source_column: lat
data_type: DOUBLE # optional; emits `lat::DOUBLE AS latitude` when present
- name: country_code # identity — no rename, no cast
Rules applied by the template:
| contract fields | emitted SQL |
|---|---|
name: x |
x |
name: x, source_column: y |
y AS x |
name: x, data_type: T |
x::T AS x |
name: x, source_column: y, data_type: T |
y::T AS x |
description becomes the model's description '...'. Columns are emitted in the order they appear in the contract. The grant list is hardcoded to staging_reader; if a staging model needs a different grant, use the escape hatch.
Escape hatch¶
Add a -- staging-codegen: skip header on the first three lines of the target SQL and the generator will leave that file alone. Use it when the staging model needs behavior the template cannot express — UNION ALL, CASE, EXTRACT, joins, filters.
Example: transforms/main/models/ebird/staging/stg_ebird_observations.sql uses the skip marker because it UNIONs recent_observations and notable_observations and derives observation_year/month/day/hour columns.
When a skipped model's contract also omits source_table, that is fine — the generator checks for the skip marker before requiring the key.
When to use which¶
| fits codegen | needs escape hatch |
|---|---|
| pure column renames | joins across raw tables |
| cast-only transforms | UNION ALL |
| identity passthrough | CASE / EXTRACT / other derivations |
| row filtering |
The default should be codegen. Reach for the escape hatch only when the template genuinely cannot express the staging shape.