Overview › ETL / quality › BuildZoom walkthrough

BuildZoom ETL — the data's journey, in human language API → Bronze → Silver → Gold · audit notes · 2026-05-19 hover any dotted term for plain-English

One-line summary. A 4-table raw ingestion (Bronze, Hudi COW) joins into a single denormalized fact table (Silver), then enriches with First-American property identifiers via a 3-condition address match (Gold). Quality is gated at each layer by a QA framework with DynamoDB tracking and Slack alerts. The pipeline is technically solid but its contract with downstream consumers is incomplete — naming says "Roofing" when contents are all-permits, no permit_category enum is exposed, and the date-sanitize gate lets 2066 through. There is no Platinum tier — ML-ready data is built ad hoc by each consumer today.

1 · Architecture — the swim-lane view

Five actors, each with a distinct responsibility. (This is a swim-lane diagram, BPMN-style.) Read left to right:

Read this diagram as a contract

Each lane owns one job. Bronze ≠ Silver ≠ Gold — when the boundaries blur (e.g. "Roofing_Master_Gold" contains non-roofing rows), downstream consumers compensate, and that compensation is exactly the problem we're trying to fix.

2 · Bronze — raw ingestion, four tables

Bronze is the literal mirror of the BuildZoom vendor extract. CSVs land in S3, an EMR Spark job reads them with an explicit schema, runs a QA gate, and writes Hudi COW (copy-on-write) tables to s3://8020rei-bronze-data-lake/. Nothing is joined here.

The four tables (one job each)

Table	Record key	Partition	Why this partition
`Roofing_Building_Permits_V1`	`BUILDING_PERMIT_ID`	`PARTITION_YEAR` from `INITIAL_STATUS_DATE`	Time-pruned scans (most queries ask "permits in year X")
`Roofing_Building_Permit_Classifications_V1`	`COMPOSITE_KEY` = `BUILDING_PERMIT_ID`\|`PROJECT_TYPE`	none	Small table (~hundreds of MB); a permit may carry multiple `PROJECT_TYPE` rows that later collapse into an array
`Roofing_Building_Permit_Statuses_V1`	`ID` = `BUILDING_PERMIT_ID` + `PERMIT_STATUS` + `EFFECTIVE_DATE`	`PARTITION_YEAR` from `EFFECTIVE_DATE`	Status timeline per permit (multiple statuses per permit, kept as history)
`Roofing_Properties_V1`	`PROPERTY_ID`	`PARTITION_FIPS` = COUNTY_FIPS	Geographic pruning (queries scan a few counties at a time)

Quality gate before write

Every Bronze write must pass a 6-check QA framework before the Hudi write commits. The checks are configurable per table via shared/configs.py; defaults below.

Check	Threshold (default)	What it catches
Null % critical fields	< 1 %	Missing primary keys (`BUILDING_PERMIT_ID`, `PERMIT_NUMBER`)
Null % important fields	< 10 %	Missing context (`PROPERTY_ID`, `TYPE`, jurisdiction)
Duplicate % on record key	< 0.6 %	Pipeline producing two-of-the-same
Completeness score (weighted)	≥ 75 %	Overall field-fill health
Negative % positive-only fields	< 0.1 %	`JOB_VALUE` < 0, `SQUARE_FEET` < 0 …
Minimum record count	≥ 1	Empty deliveries

If QA fails: data is NOT written, the run is logged to DynamoDB with QA_FAILED, and a Slack alert fires. If QA passes: Hudi write commits, then a success record is logged. This is a real strength of the pipeline.

Date sanitization (important)

Before QA runs, all date columns are clipped:

year > 2100  → 2100-12-31
year < 1900  → 1900-01-01
otherwise    → keep as-is

This prevents Spark long-overflow but is too permissive for ML. A date in 2066 (and we have observed these in the 2026-05 vintage) passes the sanitizer and survives into Silver and Gold. See Issue #2.

Hudi configuration

Table type: COW (copy-on-write) — every commit re-writes the affected files. Readers see consistent snapshots.
Target file size: 128 MB (clustering job compacts smaller files into this size on a schedule).
Schema evolution: enabled (new columns can be added without breaking older readers).
Cleaning: keep latest commit only.

3 · Silver — the fact table join

Silver is one Spark job (silver/Build_Zoom_Silver_Master.py, ~470 lines) that turns four Bronze tables into a single denormalized fact table: Roofing_Master_V1. Each row = one permit, with property and classification glued on.

The 3-join recipe

What Silver actually does to the data

Permits: extracts STATE from PERMIT_JURISDICTION, repartitions by ID_PREFIX (first char of permit ID) so subsequent joins distribute evenly.
Properties: uppercases CITY and STREET, coalesces null STATE to "00", casts COUNTY_FIPS to int. Persisted across batches.
Classifications: groupBy(BUILDING_PERMIT_ID).agg(collect_list("PROJECT_TYPE")) — turns the per-permit-per-type rows into one row per permit with PROJECT_TYPE as an ARRAY.
Statuses: Window ranking row_number() OVER (PARTITION BY BUILDING_PERMIT_ID ORDER BY EFFECTIVE_DATE DESC) then keeps row_num == 1. Status history is collapsed — only the latest is kept, exposed as AUX_PERMIT_STATUS and AUX_EFFECTIVE_STATUS_DATE.

This is where the "Roofing" name first misleads

The table is called Roofing_Master, but its filter is not a roofing filter. It's "every permit on every property" — a generic permits fact table. The name is a project-history artifact (the 8020REI vertical is roofing), not a content description.

4 · Gold — First-American address matching

Gold takes Roofing_Master_V1 (silver) and links each permit to one or more First-American property IDs. The output column FA_PROPERTYID is an array — a single permit can resolve to multiple FA properties.

The 3-condition match cascade

How the cascade resolves

The conditions are tried in order. (1) Standardized address — most accurate when present. (2) APN match — deterministic when both sides have one. (3) Street + ZIP fallback — fuzzy, can resolve one permit to many properties. A permit can match more than one condition (different FA properties for each), so the final array can contain multiple IDs. The conditions are tracked separately as FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3 internally before being combined.

FIPS salting — why the code is more complex than it looks

A handful of "big" counties (Maricopa 04013, Harris 48201, Miami-Dade 12086, etc.) hold a disproportionate share of records. A naive Spark partition-by-FIPS would put 60M Maricopa rows on one executor while others sit idle — classic data skew. The salting layer reads a config from S3 (SALTING_CONFIG_PATH) and rewrites the partition key:

Salt_Bucket = 0 → small partitions glued together as SMALL_PARTITION
Salt_Bucket = 1 → small partitions hashed across 30 sub-buckets
Salt_Bucket = -1 → smallest partitions hashed across 250 sub-buckets
Salt_Bucket > 1 → big partitions hashed across Salt_Bucket sub-buckets

The output column is Partition_FIPS_Salting. This is an operational hack done well — it absorbs skew but does not change the data semantics.

4b · The temporal contract — why we need `event_date` + `observation_period`

Property ID identifies the parcel — a permanent thing. Owner identifies who lived there, and owners change over time. To answer "who owned this property when the permit started" we need both: a property key AND a time anchor. The schema today exposes property ID but no canonical time anchor — eight date columns, no contract.

This proposal would derive two columns in Gold, used by every downstream consumer the same way. None of this exists today.

With this contract every model joins the same way: gold.property_id == rem.property_id AND rem.period == gold.observation_period. The owner returned is who lived there the month before the permit started — no leakage, no per-consumer divergence.

Edge case to document

If every date column is NULL the row gets event_date = NULL and the as-of join silently drops it. Surface this as a row-count metric in the QA gate so partitions with high NULL-date rates fail loud instead of disappearing.

5 · QA gate — same shape at every layer

Each layer (Bronze and Silver) uses a layer-specific QA class (BuildingPermitsQA, SilverMasterQA) that runs before the Hudi write. The Gold layer uses GoldMasterQA with the same pattern. Three outcomes:

┌────────────────────────┐ │ Read source / build │ │ dataframe │ └───────────┬────────────┘ │ ▼ ┌────────────────────────┐ │ Run QA checks │ │ (6 default + custom) │ └───────────┬────────────┘ │ ┌───────────────────┼───────────────────┐ ▼ ▼ ▼ ╔══════════╗ ╔══════════╗ ╔══════════════╗ ║ PASS ║ ║ QA_FAILED║ ║ ERROR ║ ╚════╤═════╝ ╚════╤═════╝ ╚══════╤═══════╝ │ Write Hudi │ Skip write │ Skip write │ Log SUCCESS │ Log QA_FAILED │ Log ERROR │ Slack ✓ │ Slack ✗ + reasons │ Slack ✗ + trace ▼ ▼ ▼ Production data Data quarantined Investigation

This pattern is uniform across layers. The output contract is "no QA-failed data ever reaches the next layer." That is real discipline.

6 · The data's journey — mental map

One row's story, end to end, from the moment BuildZoom hands us a CSV until a downstream consumer reads it:

┌─ DAY 1 ────────────────────────────────────────────────────┐ │ │ │ BuildZoom drops CSV in s3://buildzoom-landing-zone/ │ │ Faster_Format/<partition>/ │ │ • building_permits_<p>.csv │ │ • building_permit_classifications_<p>.csv │ │ • building_permit_statuses_<p>.csv │ │ • properties_<p>.csv │ │ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ BRONZE STEP ──────────────────────────────────────────────┐ │ EMR Spark job per table: │ │ │ │ read CSV(schema) ─► clip dates 1900-2100 ─► │ │ add ts + PARTITION_* ─► run QA ─► write Hudi COW │ │ │ │ → 4 tables in s3://8020rei-bronze-data-lake/ │ │ → execution row in DynamoDB │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ SILVER STEP ──────────────────────────────────────────────┐ │ One job, batched by PARTITION_YEAR: │ │ │ │ load 4 bronze tables ─► │ │ Classifications: groupBy(BPID).collect_list(PROJECT_TYPE) │ │ Statuses: window.row_number() keep latest │ │ Properties: UPPER(CITY,STREET), COUNTY_FIPS as int │ │ ─► permits ⋈ properties on (STATE, PROPERTY_ID) │ │ ─► ⋈ classifications on BUILDING_PERMIT_ID │ │ ─► ⋈ statuses on BUILDING_PERMIT_ID │ │ ─► add ts, run QA, write Hudi │ │ │ │ → Roofing_Master_V1 in s3://8020rei-silver-data-lake/ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ GOLD STEP ────────────────────────────────────────────────┐ │ One job, FIPS-salted to absorb skew: │ │ │ │ load silver permits + load Real_Estate_Master (FA) │ │ ─► load standardized addresses for both │ │ ─► condition 1: StdAddress + ZIP + FIPS │ │ ─► condition 2 (residuals): APN + FIPS │ │ ─► condition 3 (residuals): Street + ZIP + FIPS │ │ ─► collect FA_PROPERTYID's into ARRAY per permit │ │ ─► run GoldMasterQA, write Hudi │ │ │ │ → Roofing_Master_Gold_V1 in s3://8020rei-gold-data-lake/ │ └──────────────────────┬─────────────────────────────────────┘ ▼ ┌─ DOWNSTREAM (today) ───────────────────────────────────────┐ │ • Callzeke dashboard reads gold directly │ │ • Safeguard reads gold directly │ │ • Apollo (REI scoring) reads gold for Tier-G permit feats │ │ • Roofing model (new_model) reads gold to build labels │ │ │ │ ⚠ each consumer reapplies category / status filtering │ │ independently — divergence-by-construction │ └────────────────────────────────────────────────────────────┘

7 · Issues identified

HIGH = downstream consumers materially affected · MED = corner cases / future risk · LOW = naming / hygiene

The "Issue", "Where", and "Impact" columns describe what exists today (observation). The "Suggested fix" column is PROPOSAL — recommendations, not current behavior.

Sev Issue Where Impact Suggested fix

HIGH

Every table in the pipeline is named Roofing_* but contains permits of all categories — this is a permits pipeline, not a roofing pipeline

Bronze: Roofing_Building_Permits_V1, Roofing_Building_Permit_Classifications_V1, Roofing_Building_Permit_Statuses_V1, Roofing_Properties_V1
Silver: Roofing_Master_V1
Gold: Roofing_Master_Gold_V1
(Repo name buildzoom-etl and S3 bucket names *-bronze/silver/gold-data-lake are already neutral — only table names mislead.)

The naming assumes a roofing-only product, but the pipeline ingests every permit category that BuildZoom delivers. Empirical evidence: only 3.5 % of rows have TYPE='Roof' in the Pinellas 12 K sample, and only ~18 % contain "roof" anywhere in PROJECT_TYPE (finding 67 §B). Every downstream consumer (Callzeke, Safeguard, Apollo, the new model) re-applies its own roofing filter on top of data labelled as roofing-but-isn't — different filters, different counts, hours of investigation each time numbers diverge.

Treat as a coordinated rename across Bronze + Silver + Gold. Suggested mapping:

Today	Proposed
`Roofing_Building_Permits_V1`	`Permits_Master_V1`
`Roofing_Building_Permit_Classifications_V1`	`Permit_Categories_V1`
`Roofing_Building_Permit_Statuses_V1`	`Permit_Statuses_V1`
`Roofing_Properties_V1`	`Permit_Properties_V1`
`Roofing_Master_V1` (silver)	`Permits_Master_Silver_V1`
`Roofing_Master_Gold_V1` (gold)	`Permits_Master_Gold_V1`

Do it as a version bump (_V1 → _V2) with the new name rather than an in-place rename — keeps the existing consumers reading _V1 until they migrate. Adds breathing room for the cross-repo coordination (Callzeke dashboard, Apollo, new model, Glue catalog, DynamoDB log tables, Slack templates, CloudFormation specs all touch the names).

HIGH No permit_category enum exposed — separate but related to the rename across Bronze, Silver, Gold Renaming the tables to Permits_* only fixes the label. Until a coarse permit_category enum (ROOFING / HVAC / SOLAR / ELECTRICAL / PLUMBING / NEW_CONSTRUCTION / OTHER / UNKNOWN) is materialized as a column, every consumer still re-classifies from TYPE + SUBTYPE + DESCRIPTION + PROJECT_TYPE with regex, and findings 57 / 62 / 72 each produced different counts on the same data. Pair the rename with a deterministic enum column derived from the existing PROJECT_TYPE array in Silver. Fine sub-classification (e.g., REPLACEMENT vs REPAIR within ROOFING) stays downstream — that's a model decision.

HIGH Date sanitize lets 1900-2100 through, including 2066 bronze/Build_Zoom_Building_Permits.py:82-88 Future-corruption rows (we observed LATEST_STATUS_DATE = 2066-08-23 in the 2026-05-12 vintage) survive Bronze, Silver, Gold. Every label and feature pipeline downstream must re-cap. Clip to a narrower range like [1995-01-01, today + 90 days] in Bronze. Anything outside is a data-quality signal, not a value to preserve.

HIGH

No canonical event_date and no observation_period — owner-as-of joins are impossible without them

Across Bronze, Silver, Gold — the schema exposes eight date columns (INITIAL_STATUS_DATE, LATEST_STATUS_DATE, APPLIED_DATE, ISSUED_DATE, COMPLETED_DATE, CANCELLED_DATE, AUX_EFFECTIVE_STATUS_DATE, DateSnapshot) and zero canonical "this is the date that defines the permit event".

Property ID alone identifies the parcel. To answer "who owned this property when the permit started" you also need a time anchor — and there is no single canonical date in the schema. Every consumer picks a different column: Apollo uses LATEST_STATUS_DATE; the Callzeke audit used arg_max(CITY, permit_date) across the gold permit date; finding 67 ranged over multiple. The owner state in Silver REM changes monthly, so picking the wrong date silently attributes the permit to the wrong owner.

Materialize two columns in Gold (or Platinum) as a contract:

event_date = earliest non-null date across APPLIED_DATE, ISSUED_DATE, INITIAL_STATUS_DATE, COMPLETED_DATE, CANCELLED_DATE, LATEST_STATUS_DATE. Rationale: "when the permit started being worked on" — the moment the owner-at-the-time made the decision. NULL only when every date column is NULL.
observation_period = month_end(event_date) - 1 month (format YYYY-MM). This is the Silver REM snapshot key used for the as-of owner / distress / valuation join. One month before the event so the join uses the state that existed before the permit, not contaminated by it.

Both columns live in Gold so every downstream consumer joins to the same time anchor.

MED Silver collapses status history silver/Build_Zoom_Silver_Master.py · prepare_statuses Only the latest status is exposed (AUX_PERMIT_STATUS). Any analysis that needs the timeline (e.g. how long from APPLIED to ISSUED to COMPLETED, intermediate cancellations) must go back to Bronze. Keep both: AUX_* for latest + a side artifact permit_status_timeline at the per-permit grain (array of {status, date}).

MED No status normalization everywhere LATEST_STATUS mixes case and synonyms (complete, final, issued, canceled, void, expired, withdrawn …). Each consumer redefines what counts as "live" vs "dead." Add a status_normalized column in Silver with a finite enum (ACTIVE / COMPLETE / CANCELED / EXPIRED / PENDING / UNKNOWN) plus the raw value preserved.

MED FA_PROPERTYID as ARRAY surprises downstream gold/Build_Zoom_Gold_Master.py One permit → N FA properties (street-name matches especially) creates implicit fan-out at consumers that UNNEST without knowing. We measured 50 K-row samples where this matters. Document and surface this contract explicitly in a Gold README: "FA_PROPERTYID is a deduped array; consumers must decide whether to UNNEST (fan-out) or treat as a set."

MED JOB_VALUE outliers pass through Bronze QA only checks negative values $135 M permit values pass Bronze ("not negative") and contaminate downstream summaries. Add an upper-tail flag (e.g. job_value_outlier_flag at > p99.9) as a sidecar column — do not cap the value silently.

LOW Hard-coded date defaults from June 2025 shared/configs.py · DEFAULT_PERIOD = '2025-06-29' If read_last_period_executed ever falls back to default (e.g. corrupted DynamoDB row), the pipeline silently restarts from a year ago. Either remove the default and fail loud, or move it to a deploy-time env var.

LOW Silver/Gold READMEs missing repo root Bronze has an excellent README. Silver and Gold do not, so the contract lives only in code. Mirror the Bronze README structure for Silver and Gold (schemas, partitions, joins, QA thresholds).

LOW No Platinum tier — every model rebuilds features locally repo has only bronze/, silver/, gold/ ML consumers materialize feature snapshots and labels in their own caches. Reproducibility is per-consumer. See section 9.

8 · Opportunities ALL PROPOSAL

Every card in this section describes work that does not exist today. These are recommended improvements that the current architecture would absorb cleanly — not observations of current behavior.

Beyond fixes, here are higher-leverage improvements:

Silver Add `permit_category` enum

One deterministic mapping from PROJECT_TYPE array → coarse class. Lives in shared/utils.py, tested with a finite truth table. Unlocks multi-vertical without touching the rest of the pipeline.

Silver Status timeline as array

Keep AUX_* for latest. Add permit_status_timeline = array of {status, date, days_in_status}. No new joins; one window function more.

Gold Per-condition match provenance

Surface which of the 3 conditions matched (already computed internally — just keep the column). Lets downstream weight by match confidence (cond-1 strongest, cond-3 fuzzy).

Gold Generalize the table name

Rename Roofing_Master_Gold to Permits_Master_Gold. The contents are already generic; the name is the artifact.

Bronze Narrow date clip

Clip [1995-01-01, today + 90 days] and reject outliers loudly. Pipeline already has the QA hook; this is a 6-line change.

Bronze Outlier flags, not caps

JOB_VALUE, SQUARE_FEET, FEES: add side-flags at p99.9 cuts without altering the source value.

Platinum Vertical-aware feature snapshots

One ETL produces (T0, FIPS) feature snapshots that serve every vertical model (roofing, HVAC, solar). Labels per vertical live in vertical-scoped sub-paths. See section 9.

Gold Canonical `event_date` + `observation_period`

event_date = earliest non-null of the 6 status dates ("when the permit started being worked on"). observation_period = event_date month-end minus 1 month — the Silver REM snapshot key for joining to the owner / distress / valuation that existed before the permit. Materialize both in Gold so every downstream model joins to the same time anchor. See section 4b above for the diagram + edge cases.

Ops Coverage report as first-class artifact

Today coverage is reconstructed by scripts/roofing/build_coverage_universe.py in another repo. Move the per-(FIPS, jurisdiction) coverage table into Gold or Platinum as a canonical output so consumers don't keep rebuilding it.

8b · Missing cleaning layer between Bronze and Gold PROPOSAL

None of this exists today. The current Silver job is structurally a join — it consolidates the four Bronze tables into a denormalized fact, but its cleaning work is minimal: uppercase CITY and STREET, cast COUNTY_FIPS to int, coalesce missing STATE to "00", keep the latest status. Everything else flows through to Gold unchanged. Result: Gold is still dirty — future dates survive, OCR typos survive, status case-variants survive, sentinel values survive, outlier JOB_VALUE survives.

Each downstream consumer ends up re-doing the same cleaning work locally — inconsistently. This is exactly the kind of repeated work a tier is supposed to eliminate.

What the cleaning layer would do

BRONZE (raw mirror, immutable) │ ▼ ┌────────────────────────────────────────────────────────────┐ │ CLEANING LAYER (new, between Bronze and Silver) │ │ │ │ · clip dates strictly to [1995-01-01, today + 90d] │ │ (today's clip 1900-2100 lets 2066 through) │ │ · normalize LATEST_STATUS to a finite enum │ │ (lowercase + dedupe synonyms: complete/final → │ │ COMPLETE, void/cancelled/canceled → CANCELED, …) │ │ · derive permit_category (ROOFING/HVAC/SOLAR/…) │ │ from PROJECT_TYPE array │ │ · flag value outliers (JOB_VALUE > p99.9, SQUARE_FEET │ │ > p99.9) as sidecar columns — do NOT cap silently │ │ · detect sentinel / constant columns and emit warning │ │ · normalize address: trim, collapse whitespace, drop │ │ OCR artefacts ("linsincludes" type fragments) │ │ · derive event_date + observation_period (see § 4b) │ │ │ └─────────────────────┬──────────────────────────────────────┘ │ ▼ SILVER (now: pure join + denormalization, all values clean) │ ▼ GOLD (now: pure enrichment, FA matching)

Two architectural choices

Option	Shape	Pros	Cons
Option A — new sub-tier	Bronze → `Bronze_Cleaned` → Silver → Gold	Bronze stays a literal immutable mirror (the gold rule of raw data lakes). Cleaning is a separate ownable artifact with its own QA gate. Easy to compare cleaned-vs-raw rows for audit.	Another tier to maintain. Storage cost doubles for the cleaned mirror (although clustering can manage this).
Option B — harden Silver	Bronze → Silver (now: clean + join) → Gold	No new tier. Same path Silver already takes; just more work per row.	Silver's job grows from "join" to "join + clean" — bigger function, harder to test in isolation. Audit trail of cleaning vs raw is lost (cleaned values overwrite originals).

Recommendation

Option A if storage is cheap (it is — S3 + Hudi compresses well). The clean separation between "raw mirror" and "cleaned for analytics" makes the contract auditable: every cleaning rule lives in one place, every cleaned row carries a provenance pointer back to the Bronze row it came from. Silver and Gold simplify in return.

This layer also becomes the natural home for the rename (issue #1), the permit_category enum (issue #2), the date sanitize fix (issue #3), the event_date + observation_period contract (section 4b), the status normalization (issue #5), the outlier flagging (issue #7). Today these are scattered across "Bronze sanitize is too loose," "Silver doesn't normalize," and "every consumer re-derives." A cleaning layer collapses them into one job with one contract.

8c · Integrate QA + observability with the matrix hub PROPOSAL

None of this exists today. The current pipeline's observability is self-contained: DynamoDB tables (buildzoom_etl_logs, buildzoom_qa_metrics) + Slack webhook + CloudWatch. To know "did anything fail this week?" or "what's the trend of completeness scores?" requires logging into DynamoDB and writing queries.

The requirement: every Bronze / Cleaning / Silver / Gold / Platinum QA run should write its result to the central matrix hub on Eritwini (clarify platform name) so observability lives where the rest of the company already looks — not in pipeline-specific logs that only DE can read.

What "matrix hub integration" means concretely

Current	Proposed
QA result lands in `DynamoDB.buildzoom_qa_metrics`	Same row also pushed to the matrix hub via API / webhook
Failures alert Slack channel only	Failures alert Slack and create an entry in the matrix-hub incident view (visible to non-DE stakeholders)
Run history lives in DynamoDB tables, only queryable via AWS console / scripts	Run history surfaced as a matrix-hub dashboard: per-table, per-vintage, per-QA-rule. Time-series of completeness scores, duplicate %, null %, etc.
Each consumer (Callzeke dashboard, Apollo, new model) builds its own "did the upstream pipeline run cleanly?" check	Single matrix-hub query answers "is upstream healthy?" for every consumer
Schema drift detection lives in QA code, surfaced as Slack alerts	Schema drift surfaced as a matrix-hub dashboard tile with history

Why this matters (beyond convenience)

Cross-team visibility. DS and analytics teams shouldn't need DE help to verify upstream health. Today they do.
Audit trail. Compliance / external review can read the matrix-hub log without ever touching pipeline infrastructure.
Trend detection. Slack alerts catch acute failures; the matrix hub catches slow drifts (completeness score sliding from 88 → 82 → 76 over six weeks).
Single source of operational truth. Same pattern as for product KPIs and business metrics — pipeline health should live in the same place.

Open clarification

"Eritwini" — please confirm the exact platform name (suspecting transcription of an internal hub or a vendor like Airtable / Datadog / Monte Carlo / a homegrown tool). Once confirmed, the integration approach (REST API push, S3 sync, webhook, native connector) follows from the platform's supported ingestion methods.

9 · Where Platinum fits (and why) PROPOSAL

None of this exists today. The current repo has Bronze, Silver, Gold — no Platinum tier. Today every downstream model materializes its own feature snapshots and labels, divergently. This section describes what would close that gap.

What Platinum should produce

Artifact	Grain	Lives at	Built from
Labels (per vertical)	(property_id, T0)	`platinum/labels/<spec_v>/<gold_vintage>/<FIPS>/`	Gold + a versioned `SPEC.md` defining "what counts as a positive"
Feature snapshot (shared across verticals)	(property_id, T0, FIPS)	`platinum/features/<feature_set_v>/<T0>/<FIPS>/`	Silver REM (as-of T0) + Gold permit history per category + macro
Training set (labels ⋈ features)	(property_id, T0)	`platinum/training_sets/<run_id>/`	One labels artifact + one feature artifact, frozen with a manifest
Coverage table	(FIPS, jurisdiction, T0)	`platinum/coverage/<coverage_v>/<T0>/`	Gold + Silver SFH universe

Why a separate tier (and not a richer Gold)

Gold serves many consumers. Adding model-specific as-of joins or label decisions to Gold would force every consumer to inherit them. Platinum is consumer-specific (per-vertical, per-spec_v).
Spec lives with the domain. Whether a permit "counts" depends on the model's question (REPLACEMENT-only roofing vs all-permits HVAC). That decision belongs with the data scientist, not the data engineer.
Reproducibility. A Platinum manifest pins (Gold vintage SHA, spec version, feature_set version, contract violations at build time, row counts per drop rule). Every training run is auditable from a single JSON.

Recommendation

Add a platinum/ directory to this ETL repo. Production Platinum ETL runs on the same EMR / Hudi infrastructure (no new vendor). The label / feature spec lives in the DS repo (where the domain knowledge is) and is referenced by version from the Platinum build. Same operational discipline as Bronze / Silver / Gold; new contract.

11 · Deep technical audit — 3 specialist passes

Methodology

3 independent agents read the full repo (8 205 LOC across 11 PySpark scripts + CloudFormation), each with a different lens: Performance & cost, data quality & correctness, code review. 90+ findings combined. The table below distills only the cross-cutting and HIGH-severity items — every claim cites file:line. Recommendations carry the PROPOSAL tag because no fix has been implemented yet.

11a · Critical correctness bugs (must-fix; silent data loss or wrong output)

Sev	Bug	File:line	Why it matters
HIGH	`Config.QA_THRESHOLDS` attribute doesn't exist — every QA-class fallback path raises `AttributeError`	`shared/utils.py:840` (vs the real attribute `Config.DEFAULT_QA_THRESHOLDS` at `:47`)	Today masked because every caller passes `qa_thresholds=...` explicitly. The day someone forgets, the entire Bronze QA stage crashes. Single-character bug fix.
HIGH	*Hudi options misspelled `hudi.datasource.` instead of `hoodie.`* — silently ignored by Hudi	`shared/utils.py:573-580` · `bronze/Build_Zoom_Permit_Classifications.py:48-53`	Three schema-validation guards never run: `allow.schema.mismatch`, `schema.validate`, `reconcile.schema`. The team thinks schema is being validated; it's not. Vendor-side schema changes pass through invisibly.
HIGH	Silver join on derived `STATE` silently drops permits and creates fan-outs	`silver/Build_Zoom_Silver_Master.py:151-158, 184-190`	`STATE = upper(split(PERMIT_JURISDICTION, "_")[0])`. Any jurisdiction whose first underscore-token isn't a USPS code (`"CHICAGO_IL"`, `"NEWYORK_NY_MANHATTAN"`, formats without underscores) silently maps to a wrong state → join misses every property. Permits with null jurisdiction get `STATE="00"` and join to every `STATE="00"` property — unintentional cartesian explosion.
HIGH	`COUNTY_FIPS` cast to `int` strips leading zeros — violates project hard-rule #1	`bronze/Build_Zoom_Properties.py:54-63` · `silver/Build_Zoom_Silver_Master.py:91`	Alabama `01001` → `1001`. Partitions write to `PARTITION_FIPS=1001/`; every external lookup (Census, FRED, BLS, FHFA, ACS) joins on the 5-char string and silently returns zero rows. Internal cross-references also break.
HIGH	`EFFECTIVE_DATE` allowed NULL in composite Hudi record key — distinct rows silently collide	`bronze/Build_Zoom_Permit_Statuses.py:42-63` · `shared/configs.py:114`	`RECORD_KEY = concat_ws("-", BUILDING_PERMIT_ID, PERMIT_STATUS, EFFECTIVE_DATE)`. `concat_ws` skips nulls — so two genuinely distinct status events with null date collapse to the same key and Hudi upsert silently overwrites. Per config comment ~30 % of statuses have null date.
HIGH	Clustering health check reports row-count regression as `warning`, exits 2 (non-fatal)	`bronze/Hudi_Clustering_Maintenance.py:413-424` · `silver/Hudi_Clustering_Maintenance.py:291-301`	Pre-vs-post row count mismatch >1 % is logged as a warning with comment "might be OK if data was being written concurrently." It is not OK — that's exactly the data-loss / duplication scenario. Step Functions treats exit 2 as non-fatal; silent loss with a yellow icon.
HIGH	DynamoDB execution log has no `ConditionExpression` — concurrent writes overwrite each other	`shared/utils.py:178-242, 297-339`	Cursor (`read_last_period_executed`) computed by client-side sort over full history; writes are unconditional `put_item` keyed on `(table_name, execution_timestamp_iso)`. Two jobs landing in the same second silently overwrite; two concurrent workers both decide "partition X is next" and both write SUCCESS — duplicate Hudi upserts plus a lying log.
HIGH	Athena snapshot swap is non-atomic — old data deleted before new is confirmed queryable	`gold/Build_Zoom_Gold_Master.py:920-986`	Three-step sequence: ALTER LOCATION → ALTER TBLPROPERTIES → delete old prefix. Between steps the table can return zero rows or scan dead paths. No smoke-test query before the delete. If property update fails, no rollback. If delete succeeds but downstream issue forces rollback, data is gone. Polling loop `while True: sleep(1)` is unbounded — stuck Athena hangs Gold forever.
HIGH	Gold address matching has no key normalization — silently misses matches	`gold/Build_Zoom_Gold_Master.py:319-323, 347-356, 385-389`	(1) `"ST"` vs `"STREET"` vs `"AVE"`/`"AVENUE"` — straight string compare, no USPS expansion. (2) Suite / unit / apt suffixes not normalized (`"APT 4B"` ≠ `"#4B"` ≠ `"UNIT 4B"`). (3) `ZIP_CODE` cast to Integer strips leading zeros — every New England state (MA, CT, RI, NJ, VT, ME, NH) loses ZIP matching. (4) APN format varies by county (LA: `1234-567-890` vs `1234567890`; Cook: `12-34-567-890-1234`) — no `regexp_replace`, Cook can lose 100 % of APN matches.
HIGH	Gold `aggregate_properties` groupBy on all 60+ columns; `coalesce` kills multi-condition matches	`gold/Build_Zoom_Gold_Master.py:421-491`	(1) Two roofing rows differing in even one nullable text field produce two Gold rows for the same `BUILDING_PERMIT_ID` instead of one with a 2-element array. (2) Code uses `coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3)` before `collect_list` — so a permit matching FA `"a"` by std-address AND FA `"b"` by street name only keeps `["a"]`. The docstring's promise "Multiple FA_PROPERTYIDs aggregated into array" is contradicted by the code. (3) `collect_list` not `collect_set`, order non-deterministic — consumers doing `FA_PROPERTYID[0]` get different answers per rebuild.
HIGH	`F.current_timestamp()` as Hudi precombine key — re-runs are non-deterministic	`shared/utils.py:723` · all four Bronze drivers (`Permit_Statuses:69`, `Properties:69`, `Permit_Classifications:121`, `Building_Permits` via transform)	Re-ingesting the same source file produces different precombine values; you can never reliably skip a duplicate or verify "is this the same row as last week". Worse: in the same Spark stage, executor-local clocks can resolve out of order, so a tie-break inside Hudi picks arbitrarily.
HIGH	Silver latest-status window ties not broken — re-runs pick different "latest"	`silver/Build_Zoom_Silver_Master.py:118-129`	`Window.partitionBy("BUILDING_PERMIT_ID").orderBy(EFFECTIVE_DATE.desc())`. Two statuses on the same day → `row_number()=1` non-deterministic. Same Bronze → different Silver across runs.
HIGH	PERMISSIVE CSV mode silently nullifies bad rows — no `columnNameOfCorruptRecord` sidecar	`bronze/Build_Zoom_Building_Permits.py:67-75` + analogous in all 4 Bronze readers	A row that fails to parse becomes typed-nulls instead of being quarantined or logged. The QA's null-percentage gate may not trip if only one field is corrupt. No count of rejected rows surfaces anywhere.
HIGH	`BuildingPermitsQA` mutates class attributes at runtime across 4 drivers	`bronze/Build_Zoom_Building_Permits.py:350-352` + 3 analogous (Properties, Permit_Statuses, Permit_Classifications)	Each driver does `qa_checker.CRITICAL_FIELDS = SomeConfig.CRITICAL_FIELDS` on a single instance. Works only because each driver runs in a separate process. Any consolidation (test harness, notebook) bleeds state between tables. Class is also misnamed for one of the four tables it now serves.

11b · Performance & cost HIGH wins

Estimated achievable: 40-60 % wall-clock reduction + 30-50 % cost reduction on the Gold layer alone (the most expensive). Numbers are agent estimates; require real billing data to confirm.

Fix	Mechanism	Impact
Eliminate Gold's 16+ wide-shuffle `.count()` calls for logging	`gold/Build_Zoom_Gold_Master.py:151,182,226,271,341,373,402,440,445,451,452,488`. Replace per-step counts with one aggregated `.agg(F.count_when(...))` at the end; gate behind `logger.isEnabledFor(DEBUG)`.	30-50 % Gold wall-clock
EMR cluster lifecycle — keep one cluster across Bronze + Silver + Gold	Today three independent Step Functions each spin EMR up/down (`~8 min × 3 = 24 min` bootstrap per run). Single state-machine with sub-state-machines on one cluster ID.	24 min/run + 3× bootstrap cost
Add EMR managed scaling + spot TASK group	All three CFNs hard-code `3 × r7i.4xlarge core, no managed scaling, no TASK group, no spot, gp2 EBS`. Bronze tolerates spot churn (Hudi rollback). Switch EBS to `gp3` (20 % cheaper, faster).	2-4× throughput/$
Don't re-read the entire Hudi target for QA on every partition	`bronze/Build_Zoom_Building_Permits.py:109` + analogous. QA's new-vs-updated computation does `spark.read.format("hudi").load(hudi_path)` with no filter — scans 483 M rows for classifications, >50 M for properties, on every partition. Project only the record-key column and filter by source's partition keys before the join.	50-80 % per-partition QA time
Replace 60-column groupBy with key-only aggregation + back-join	`gold/Build_Zoom_Gold_Master.py:480-484`. `matched_df.dropDuplicates(['BUILDING_PERMIT_ID']).drop('FA_PROPERTYID')` then join the `collect_list` back on `BUILDING_PERMIT_ID` alone.	Huge — 60-tuple shuffle eliminated
Switch `bulk_insert` mode from per-partition append to single union write	`gold/Build_Zoom_Gold_Master.py:612-660` + loop at `:873-881`. `bulk_insert` is intended as a single shot — 67 calls inside `mode("append")` repeatedly cycles Hudi metadata. Either union all partitions into one DataFrame before writing, or scale `hoodie.bulkinsert.shuffle.parallelism` dynamically by row count.	30-60 % Gold write time
Bump `hoodie.cleaner.commits.retained` from 1 to 3	`shared/utils.py:586`. Current value strips all rollback ability and works against COW write amplification.	Recovery margin + fewer cleanup writes
Stop using `df.limit(1)` upsert to trigger clustering	The "limit-1 upsert with inline clustering" trick rewrites real file groups on every "trigger" call (`bronze/Hudi_Clustering_Maintenance.py:290`, `silver:237`, `gold/Run_Clustering.py:79`). Use Hudi native `CALL run_clustering(...)` procedure; on failure, log and skip — never write phantom records.	20-40 % clustering cost
Single Spark session builder; eliminate 7× copy-pasted boilerplate	`Build_Zoom_Building_Permits.py:294-334` + 6 analogous. `shuffle.partitions`, `coalescePartitions`, executor sizing all drift between copies. Extract `shared.spark.make_session()`.	Consistency + faster onboarding
DynamoDB query: add GSI or use `Limit + ScanIndexForward=False`	`shared/utils.py:198-216, 262-275, 369-394`. Today every job start scans full history.	O(N) DynamoDB RCU growth eliminated

11c · Silent-failure data-quality issues

Sev	Failure mode	File:line
HIGH	`"00"` sentinel STATE creates fan-out matching (covered in 11a)	`silver/Build_Zoom_Silver_Master.py:151-158`
HIGH	`PARTITION_YEAR=9999` sentinel for null INITIAL_STATUS_DATE rows is never in `batch_partitions` filter → Silver silently drops them	`shared/utils.py:716` · `silver/Build_Zoom_Silver_Master.py:148`
HIGH	Properties `schema_evolution=true` + `auto.evolution.column.type.override=STRING` means a new vendor column lands as STRING regardless of true type — silent type coercion	`shared/utils.py:573-580`
HIGH	`Salt_Bucket=0 → "SMALL_PARTITION"` literal groups all small-FIPS into one bucket — cross-county APN collisions become possible (BuildZoom permit in FIPS A matches FA property in FIPS B by APN string)	`gold/Build_Zoom_Gold_Master.py:91-93, 354-356`
HIGH	Classifications `collect_list` not `collect_set` — duplicate `PROJECT_TYPE` rows in Bronze produce `["Roofing", "Roofing"]` arrays in Silver	`silver/Build_Zoom_Silver_Master.py:67-69`
MED	Multi-line `DESCRIPTION` with embedded newlines breaks CSV parsing (`multiLine` not set on 3 of 4 readers)	`bronze/Build_Zoom_Building_Permits.py:67-75` + 2 analogous
MED	`extract_partitions_inside_folder` picks first match by sort if vendor accidentally drops a backup file (e.g. `.csv.bak`) — no assertion that exactly 1 file matches	`bronze/Build_Zoom_Building_Permits.py:419-422` + analogous
MED	No referential integrity check: every `BUILDING_PERMIT_ID` in `Permit_Statuses` should exist in `Building_Permits` — orphan status rows silently inflate Silver	no test exists
MED	Completeness score is gamable — weighted average over all columns means adding always-full Hudi meta columns dilutes bad fields	`shared/utils.py:988-1009`
MED	No value-distribution drift check vs previous partition — a jurisdiction dropping 80 % of permits goes unflagged	`shared/utils.py:1042-1060`
MED	QA "initial-load" path treats `df_target = None` as initial — if the target read fails silently (permission denied), every record becomes "new" and you ship a full duplicate	`shared/utils.py:923-924`
MED	Silver's "Real Estate Master period" is "freshest S3 prefix" — not deterministic across Gold partition batches if Real Estate publishes mid-Gold-run	`gold/Build_Zoom_Gold_Master.py:200-203`

11d · Cross-cutting architectural concerns

1. `shared/utils.py` is a 2 853-line megamodule

It contains Config, schemas, S3 helpers, DynamoDB helpers, checkpoint logic, Hudi write helpers, transformations, Slack helpers, and three near-duplicate QA classes (BuildingPermitsQA, SilverMasterQA, GoldMasterQA — ~85 % identical, including a bug fix made in Silver/Gold but missing from Bronze around _hoodie_* column filtering). Every change has high blast radius and every test requires booting Spark + faking AWS. Split into shared/{spark,config,s3,dynamodb,hudi,qa,notifications}.py with a thin BronzePipeline / SilverPipeline / GoldPipeline base class to absorb the partition-loop boilerplate (currently copy-pasted 4×) and the Spark-session builder (currently copy-pasted 7×).

2. The execution model assumes serial Step Functions and no human re-runs

Nothing in code enforces "one writer per table". Protection is by convention — README says "ONLY when no ETL jobs are running"; CLAUDE.md says "don't edit mid-run". A senior shop would put a DynamoDB-backed lock around each (table_name, partition) and around the clustering job, with TTL and ownership records, and would make every put_item use ConditionExpression. Until that exists, the _V1 / _V2 versioning scheme is the only rollback you have — and snapshot deletion in Gold (issue 11a #8 above) erases it.

3. Idempotency is partial and silently asymmetric across layers

Bronze uses upsert with a non-deterministic timestamp precombine — safe but expensive on re-run. Silver runs full-refresh upsert — safe but rewrites the entire universe per run. Gold uses bulk_insert — fast but not idempotent if a single partition re-runs (no built-in dedup). Three layers behave differently under the same Step Functions retry policy. Standardize on one strategy per layer with documented retry semantics and surface "this partition has been written N times this snapshot" in DynamoDB so monitoring catches partial-success retries.

4. No CI / lint config — the pipeline ships to prod with no static analysis gate

No ruff, no mypy, no black, no requirements.txt visible. Dead imports, misspelled Hudi options, missing type hints, and dead args accumulate without anyone noticing. This is the cheapest single fix on the entire list — a 1-hour add-CI PR catches future regressions of every category in 11a / 11c.

How to act on this section

The 13 HIGH bugs in 11a are correctness issues — they produce wrong output today, not just future risk. The performance wins in 11b are agent estimates that need real billing data to confirm. The full agent transcripts (90+ findings) are too long to embed here; they live in the agent task outputs and can be re-derived by re-running the three specialist agents over the repo (Performance Benchmarker, Data Engineer, Code Reviewer).

12 · Anexo — independent verification of section 11 findings

Methodology

The 13 HIGH-severity bugs in §11a were originally surfaced by three specialist agents in a single pass. To avoid rubber-stamping, two independent verifier agents (Code Reviewer + Data Engineer) were dispatched separately, told not to trust the original wording, and asked to read the cited code fresh. They produced verdicts of CONFIRMED / PARTIAL / EXAGGERATED / REFUTED / REQUIRES-DATA. The "Consensus" column below reconciles their two views.

12a · Consensus table

#	Finding (one-line)	V1	V2	Consensus	What changes from §11a
F1	`Config.QA_THRESHOLDS` attribute missing	CONFIRMED	CONFIRMED	CONFIRMED	None. Pure typo, latent because every caller passes `qa_thresholds=`.
F2	Hudi options `hudi.datasource.*` silently ignored	CONFIRMED	CONFIRMED	CONFIRMED	Refined fix: not a simple prefix swap — 2 of 3 keys (`allow.schema.mismatch`, `schema.validate`) have no real `hoodie.` equivalent*. Author intent ("validate + reconcile") needs translation to actual Hudi keys: `hoodie.datasource.write.reconcile.schema` + `hoodie.avro.schema.validate`.
F3	Silver join on derived STATE drops & fan-outs	CONFIRMED	CONFIRMED	CONFIRMED	Both failure modes empirically defensible. Magnitude data-dependent (depends on real `PERMIT_JURISDICTION` formats), but the `"00"` sentinel collision is structural — happens the first time any null arrives on either side.
F4	`COUNTY_FIPS` cast to int strips leading zeros	CONFIRMED	CONFIRMED	CONFIRMED	Bronze actually handles nulls/non-numeric with a "00000" sentinel at `:54-60` — and then casts to int at `:63`, throwing away the work. The author knew FIPS is a string then ignored it.
F5	NULL `EFFECTIVE_DATE` in composite Hudi key	CONFIRMED	CONFIRMED	CONFIRMED	`concat_ws` null-skip behavior verified in Spark 2.x / 3.x / 4.x. Per config comment ~30 % nulls → wide collision surface.
F6	Clustering row-count drift >1% becomes warning + exit 2	PARTIAL	PARTIAL / EXAGGERATED	PARTIAL	The drift-to-warning downgrade is real. The "exit 2 is silent" claim depends on how the Step Functions caller treats non-zero exits. REQUIRES-DATA: confirm the State Machine doesn't special-case exit 2. Bigger architectural smell: the "concurrent writes" excuse implies clustering can race with ETL writers — that's the real bug.
F7	DynamoDB execution log race	CONFIRMED	PARTIAL / EXAGGERATED	PARTIAL — refined	Real bug is different from original framing. The "two concurrent writes within the same second" is unlikely at ISO-microsecond precision. The genuine concern: `read_last_period_executed` does a full unpaginated DynamoDB `Query`, response is truncated to 1 MB by AWS — at ~10 K items the cursor silently misses older successes that may be the actual max-period. Fix: add pagination loop.
F8	Athena snapshot swap non-atomic + delete unverified	PARTIAL	CONFIRMED	CONFIRMED — refined	Code does check Athena query `state == SUCCEEDED` (V1's correction). What's still missing: a SELECT smoke-query against the new location after the ALTERs and before the delete. Polling loop is unbounded — stuck query hangs forever. Two-ALTER sequence is non-atomic.
F9	Gold address matching has no key normalization	CONFIRMED	PARTIAL	PARTIAL — refined	Std-line + Std-zip ARE handled upstream by the `Standardized_Addresses_*` tables — Condition 1 in Gold consumes them correctly. Real remaining gaps: (a) APN normalization (raw `PARCEL_NUMBER == APN`, no `regexp_replace`, format varies per county), (b) `ZIP_CODE` cast to `IntegerType` drops leading zeros (whole New England loses ZIP joins). Condition 3 street normalization opaque — needs more investigation.
F10	Gold `coalesce` + 60-col groupBy kills multi-condition matches	CONFIRMED	REFUTED-as-stated, real bug present	CONFIRMED	Both verifiers converged: bug is real, mechanism is exactly what was claimed (coalesce runs BEFORE collect_list, kills multi-match). V2's "refuted-as-stated" was about minor framing wording, not the bug substance.
F11	`F.current_timestamp()` as precombine non-deterministic	CONFIRMED	PARTIAL / EXAGGERATED	PARTIAL — context-dependent	Under single-writer architecture (this pipeline today — Step Functions serializes), `current_timestamp()` is safe enough. The risk bites if (a) anyone re-runs a partition expecting idempotency, (b) two parallel jobs ever touch the same key (forbidden by convention but not by code). Status as a HIGH depends on whether the team wants re-run idempotency.
F12	Silver latest-status window tie not broken	CONFIRMED	CONFIRMED	CONFIRMED	Spark window sort is provably not stable across the shuffle exchange. Reproducibility broken. Compounds with F5 (null-date status collisions) — a permit with 3 null-date statuses has 3 ties at the top → any can be "latest".
F13	PERMISSIVE CSV silently nullifies bad cells, no audit trail	CONFIRMED	CONFIRMED	CONFIRMED	Properties.py is worse than originally stated — it sets no mode at all, defaulting to PERMISSIVE implicitly (other three Bronze readers at least set it explicitly). Per-cell null (not whole-row null) — equally dangerous because it's silent.

12b · Final tally

8

CONFIRMED

F1 · F2 · F3 · F4 · F5 · F8 · F10 · F12 · F13

5

PARTIAL / REFINED

F6 · F7 · F9 · F11
(real bug, refined framing or context-dependent)

0

FULLY REFUTED

No finding survived as completely wrong.

12c · What, why, how — for each CONFIRMED finding

Compressed implementation-level summary. Numbers in parens are the original finding ID from §11a.

ID	WHAT (the bug)	WHY (consequence)	HOW (one-line fix)
F1	`Config.QA_THRESHOLDS` doesn't exist — fallback path crashes	Latent AttributeError; today masked by every caller passing kwarg explicitly	`shared/utils.py:840`: `Config.QA_THRESHOLDS` → `Config.DEFAULT_QA_THRESHOLDS`
F2	3 schema-validation options use `hudi.` prefix instead of `hoodie.` — silently ignored	Schema-validation guards never run; opposite of intent ships to prod	Replace with real keys: `hoodie.datasource.write.reconcile.schema` + `hoodie.avro.schema.validate`. Drop the third (no real equivalent).
F3	Silver derives STATE via `upper(split(JURISDICTION, "_")[0])` with `"00"` fallback; Properties does same with raw STATE	Permits with non-USPS jurisdiction → lost from join. Null jurisdiction → cartesian fan-out against all `"00"` properties.	Replace coalesce-to-sentinel with null-safe equals + explicit drop count; carry raw `PROPERTY_STATE` from Properties and join on that.
F4	`COUNTY_FIPS` + `PARTITION_FIPS` cast to `IntegerType`	Alabama `01001` → `1001`. S3 partition path mangled. External FIPS lookups (Census, FRED, BLS, FHFA, ACS) return zero rows.	Keep `StringType` with `F.lpad(col.cast("string"), 5, "0")` — no int cast at any layer.
F5	`concat_ws("-", BPID, STATUS, EFFECTIVE_DATE)` as RECORD_KEY with `EFFECTIVE_DATE` allowed NULL — concat_ws skips nulls	Two distinct status events with null date collapse to same Hudi key; upsert silently overwrites; per config ~30% of statuses have null date	`F.concat_ws("-", BPID, STATUS, F.coalesce(EFFECTIVE_DATE.cast("string"), F.lit("__NULL__")))`
F6	Clustering row-count drift >1% logged as warning + exit 2; "might be OK if concurrent writes" comment	Possible data loss/duplication exits with non-zero code but Step Functions handler MAY treat as success. Concurrent-write excuse signals deeper coordination gap.	Promote drift to hard failure (`sys.exit(1)`). Acquire write lock on table before clustering. Document exit-code contract.
F7	`read_last_period_executed` does unpaginated DynamoDB Query → response truncated to 1 MB at ~10K items → may miss older successes	Cursor advancement is wrong at scale; jobs may re-process completed partitions or skip uncompleted ones	Add pagination loop with `LastEvaluatedKey`; better, replace with `UpdateItem` on a single "last_known_success" row with conditional expression.
F8	Athena LOCATION swap = 2 ALTERs + delete-old, no SELECT smoke-test between, polling loop unbounded	Inconsistent window between ALTERs (zero-row queries possible); stuck Athena hangs forever; failed second ALTER leaves table broken + old data already gone	Single Glue catalog API update (transactional). SELECT COUNT(*) smoke-test after ALTERs, before delete. Bound polling with `max_seconds=300`. Soft-delete to `quarantine/` with lifecycle rule.
F9	Gold address matching: APN raw equality with no normalization, ZIP cast to int strips leading zeros	APN format varies by county (LA `1234-567-890` vs `1234567890`; Cook `12-34-567-890-1234`) → silent miss. New England ZIPs `07021` → `7021` (both sides same int cast, so they match each other — but downstream ZCTA joins via Census fail). Std-line + std-zip are handled upstream.	`regexp_replace(APN, "[^A-Z0-9]", "")` on both sides + per-FIPS APN format profile. Cast ZIP to `lpad(string, 5, "0")`, not int.
F10	`coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3)` picks first non-null BEFORE `collect_list`; 60-col groupBy keeps spurious row splits	Permit matching FA-A by std-address AND FA-B by APN keeps only FA-A — promised multi-match arrays don't materialize. Docstring contradicts code.	Replace coalesce with `array_distinct(array(id1, id2, id3))` per row, then `groupBy(BUILDING_PERMIT_ID).agg(array_distinct(flatten(collect_list(...))))`. Drop the 60-col group.
F11	`F.current_timestamp()` as Hudi precombine — value depends on processing time, not source data	Re-ingestion is not a no-op; re-run "wins" over original; concurrent writers race	For Hudi precombine, use source-derived date (e.g. `LATEST_STATUS_DATE`) or driver-side `lit(datetime.utcnow().isoformat())` materialized once per write. Safe under current single-writer model.
F12	Latest-status window: `orderBy(EFFECTIVE_DATE.desc())` with no tiebreaker → `row_number()=1` non-deterministic on ties	Same input → different Silver across runs. Reproducibility broken. Compounds with F5: null-date ties at the top.	Add a deterministic secondary key: `.orderBy(EFFECTIVE_DATE.desc_nulls_last(), F.col("PERMIT_STATUS").desc(), F.col("ts").desc())` — or define a domain status priority.
F13	PERMISSIVE CSV mode with typed schema nullifies bad cells per-cell; no `columnNameOfCorruptRecord` → bad rows discarded silently. Properties.py doesn't even set the mode (implicit PERMISSIVE).	Vendor format drift produces silent nulls (e.g. `JOB_VALUE = "N/A"` → null), QA threshold `max_null_pct_critical_fields = 1%` may not trip on narrow drifts	Add `.option("columnNameOfCorruptRecord", "_corrupt_record")` + sidecar column; count and emit to QA metrics. Set explicit mode on Properties.

Final note on this verification pass

Zero findings were fully refuted. Eight are confirmed outright, five are confirmed-with-refined-context. The five PARTIALs do not mean "go ahead and ignore" — they mean the recommended fix or the severity rating shifts. Treat all 13 as real items in the engineering backlog; the verifier nuance is what informs prioritization, not whether to fix at all.

12d · Architectural logic audit — structural issues (separate from code bugs)

Why this section exists

The bugs in §11a are code-level — typos, race conditions, missing normalizers. They can be fixed without touching the architecture. A 4th independent agent (Software Architect) audited the logic of the pipeline itself — layer responsibilities, contract boundaries, naming, dependency graph, reproducibility model. Goal: find structural issues that mean "even if every bug were fixed, this design would still be wrong in important ways."

Top 5 structural issues

#	Issue	Where	Severity
S1	Silver advertises a partition key (`PARTITION_FIPS`) that the Silver job never creates. Config declares `PARTITION_KEY = "PARTITION_FIPS"`; the Silver job batches by `PARTITION_YEAR` and produces a `PARTITION_FIPS` column only for permits whose property-join succeeds (when join fails → NULL FIPS → row silently disappears from any Gold filter on FIPS). Worse: every Silver batch touches every FIPS partition file — defeats partitioning as an idempotency boundary. The "zero-downtime full refresh" is a marketing label, not a property.	`shared/configs.py:270` + `silver/Build_Zoom_Silver_Master.py:164-283`	HIGH
S2	The `STATE` column shadows two different concepts and is the silent root of join brokenness. Properties: `STATE` = postal state from assessor. Permits: `STATE` = re-derived in Silver from `upper(split(PERMIT_JURISDICTION, "_")[0])`. Same column name, completely different meaning. Join treats them as equal — they are not. BuildZoom permit jurisdictions are city/county/agency identifiers whose prefix is sometimes a state code, sometimes a FIPS string, sometimes a city slug. Layer-responsibility leak: Silver re-derives a column with the same name to make a join work.	`silver/Build_Zoom_Silver_Master.py:94, 150-154, 186-190`	HIGH
S3	Snapshot semantics are pinned to Bronze, but Gold output fuses 4 independently-versioned upstream tables. Gold builds snapshot suffix from `get_last_bronze_date_snapshot()` and writes `Roofing_Master_Gold_V1_<YYYYMMDD>`. But the join body reads (1) Silver Roofing_Master with no period filter, (2) Silver Real_Estate_Master filtered by S3-folder-listed last_period, (3-4) two Standardized_Addresses tables with no version pin. Three of these tables are not even built by this repo. One Gold "snapshot" stamps a Bronze date onto an output fused from 4 vintages; reproducibility from Gold row → source is broken at the moment of writing. Athena ALTER + delete-old is destructive — no rollback after the flip.	`gold/Build_Zoom_Gold_Master.py:734-772, 188-209` + `shared/utils.py:245-282`	HIGH
S4	Gold mixes data engineering with business decisions — the FA enrichment is a domain join, not a tier. The medallion definition would have Gold contain aggregated business-ready facts. What this Gold actually contains is a probabilistic entity-resolution result: 3-condition fuzzy address match, fall-through coalesce with implicit trust ordering, multiple FA_PROPERTYIDs collected into an array. The match policy is invisible to consumers (no per-row quality flag); tuning the match logic requires a full Gold rebuild (10⁸ rows); the 1:N permit→property fanout becomes an array column without a documented grain change.	`gold/Build_Zoom_Gold_Master.py:8-13, 277-469`	MED-HIGH
S5	*The `Roofing_` naming is a domain lie that will block multi-vertical scaling.** Bronze contains every BuildZoom permit category. The labels are aspirational ("for the 8020REI Roofing vertical"), the contents are generic. Not cosmetic: a future HVAC or solar vertical forces either (a) re-ingesting the same BuildZoom data into `HVAC_*` tables (1× storage + 1× Hudi maintenance per vertical) or (b) accepting "Roofing_Master" as the master for everything (in which case the name is a permanent lie). Current code locks in option (a) — Step Functions stack name, CFN, QA classes all encode "Roofing".	Whole repo · `configs.py:30, 69, 112, 152, 254, 325`	MED

Other structural issues worth tracking

Category	Finding
Layer leak	Properties Bronze casts FIPS, derives partitions, generates DateSnapshot — that's Silver-cleanup work done in Bronze. Silver then re-casts COUNTY_FIPS again.
Layer leak	Silver collapses Statuses (event-stream) to "current status per permit" via window function — that's a business aggregation that destroys history. Gold or a view should own it; Silver should expose the un-collapsed event timeline.
Naming	`AUX_PERMIT_STATUS` / `AUX_EFFECTIVE_STATUS_DATE` are actually "latest status" — name tells the consumer nothing. There's already a `LATEST_STATUS` field on permits, creating two columns with overlapping meaning.
Naming	`S3_TABLE_PATH` on `RoofingGoldConfig` points to a base path that never exists in S3 — real tables live at `_<YYYYMMDD>` snapshots. QA reads from the non-existent base path.
Naming	`HUDI_PARALLELISM` constant is used as `record_size` in `write_table_with_bulk_insert` — name says parallelism, use says size estimate. They are not the same thing.
Dependency design	Gold has 3 hard dependencies on tables not built by this repo (Real_Estate_Master, two Standardized_Addresses tables). If FA changes its column name `PropertyId` → `propertyId`, Gold silently breaks with empty matches. No upstream contract documented.
Dependency design	Bronze QA reads Bronze table state from run N−1 to compute deltas. Re-running breaks QA determinism.
Idempotency	Three layers, three different write policies (Bronze upsert, Silver full-refresh-upsert, Gold bulk_insert into new snapshot). There is no single "this end-to-end vintage is consistent" record. Asking "what's the latest fully-consistent (Bronze, Silver, Gold)?" requires reconciling 3 DynamoDB tables on incompatible date formats.
Reproducibility	No source-file fingerprint stored on any row. A Gold row cannot be traced back to "this BuildZoom CSV, line N." DynamoDB records the partition (`2025-12-10`) but not the file ETag, byte count, or schema hash.
Schema contract	Schema source of truth is a switch statement inside `shared/utils.get_hudi_schema(...)` — the 2853-line god module. No column-level contract: `LATEST_STATUS` "important" but no enum / no range / no allowed-value check.
Schema contract	Schema evolution policies disagree: Hudi has auto-evolution enabled with type-override to STRING; the Spark reader uses a fixed schema and drops new columns at read. Two policies, opposite directions.
Coupling	Silver reads Bronze via `spark.read.format("hudi").load(BRONZE_PATH)` with no commit pin. If Bronze for the active year is mid-commit (Hudi COW), Silver may read a half-finished file group. Step-Function serialization papers over the issue for daily runs but breaks under manual re-processing.
Coupling	Athena DDL string is embedded inside Gold Spark job. Renaming the catalog or table requires editing Python code.
Modeling gap	No SCD treatment on Properties (assessor data, slowly-changing). Hudi upsert overwrites without effective_from/effective_to — historical analysis cannot answer "what did we know about this property on date X?"
Modeling gap	No event view of permits. Statuses is a natural event stream (applied → issued → inspected → completed) but Silver collapses to "latest only". The medallion pattern flattens an event source into a snapshot — the opposite of best practice for lifecycle analysis.
Cross-cutting	`shared/utils.py` at 2853 lines holds Spark schemas, Hudi writer, DynamoDB tracker, S3 lister, three QA classes, Slack formatter, snapshot resolver. A god module; refactoring it requires re-deploying every layer.
Cross-cutting	Salting config (`salting_partitions.json`) lives at `s3://8020rei-staging-data-lake/diego_staging/` — in someone's personal staging path. Deployment landmine.
Hard-coded assumption	One vendor (BuildZoom), one enrichment provider (First American, `FA_PROPERTYID` baked into Gold schema), one region (`us-east-1` hard-coded in 3 files), one vertical, one environment. A second of any of these requires editing string literals in 5+ files.
Future-fragility	Step Functions chain is hard-coded serial Bronze → Silver → Gold. A second vertical needs a second state machine + second EMR cluster, or a fork. CFN doesn't support multi-tenancy.
Future-fragility	Snapshot deletion on next successful run means the first successful run after a bug ships overwrites the last-known-good. No rollback.

Architectural verdict

The medallion pattern is conceptually appropriate for this domain — permit data does benefit from a raw mirror, a denormalized fact, and an enriched serving layer. But the implementation has not internalized what each tier owes the next. Bronze does Silver work; Silver does Gold work and also re-does Bronze work it shouldn't trust; Gold does platform work (Athena DDL) and business decisions (match-method priority) that should be a thin policy layer above pure data. The contracts between layers are implicit — no manifest, no version pin, no schema enum — and the moment a downstream consumer wants to ask "is the data I'm reading consistent?" they cannot.

The structural ceiling is single-vendor, single-domain, single-region naming. The codebase is shaped like a one-shot data engineering project for BuildZoom-Roofing, not like a platform. Adding a second vendor doubles the surface area. Adding a second enrichment provider requires schema changes everywhere FA_PROPERTYID appears. Adding a second region requires editing string literals in 5+ files. The Roofing_* prefix, the hard-coded us-east-1, and the _V1 suffix together signal that the team is not yet thinking in terms of a long-lived platform.

A re-design starting today would: (1) name Bronze + Silver after vendor + entity (buildzoom.permits_raw, buildzoom.permits_facts), reserve domain language ("Roofing", "Solar") for Gold serving views; (2) make the precombine key a deterministic source_file_etag + row_number, never current_timestamp(), so re-runs are bit-identical; (3) split entity-resolution out of Gold into a "match candidates" intermediate with match_method + match_score columns, and let Gold be a thin policy view; (4) replace the VERSION_SUFFIX = "_V1" pattern with a snapshot/manifest table that records {bronze_commit, silver_commit, fa_period, std_addr_commit} per run, with retention policy and rollback; (5) move shared/utils.py into 3–4 cohesive modules and treat each layer's input/output schema as a versioned contract, not a Python switch statement; (6) make Step Functions parameterized over {vendor, domain} so a second vertical is a config change, not a fork.

The medallion isn't wrong. The implementation has just not invested in the contracts between tiers — and contracts, not technology, are what makes a lakehouse maintainable as it grows.

How to read structural vs code-level findings together

The 13 HIGH bugs in §11a are tactical fixes. The 5 structural issues in §12d are strategic re-architecture. Fixing the code bugs without addressing the structural ones leaves the same shape vulnerable to the same class of bugs later. Fixing structure without code-level fixes leaves shipping data wrong today. Do both; sequence the structural changes around the natural rebuild cycle (e.g., when the team next versions the tier from _V1 → _V2).

10 · Glossary & sources

Term	Meaning
Hudi COW	Apache Hudi Copy-on-Write table. Each commit rewrites affected files; readers see consistent snapshots. Single-commit vintages = no version inflation (we verified 1.0× on Pinellas 12103).
Medallion architecture	Bronze (raw) → Silver (clean, joined) → Gold (enriched, joinable). Industry vocabulary (Databricks-coined).
FIPS	Federal Information Processing Standards — 5-digit county identifier (always zero-padded). Used as partition key for geographic pruning.
FA	First-American — the property-records vendor whose `PropertyId` Gold links permits to.
FA_PROPERTYID	Array of First-American property IDs matched to a permit via the 3-condition cascade.
FIPS salting	Spark anti-skew technique: hash-distribute big-FIPS rows across N sub-buckets so executors share load.
COMPOSITE_KEY	Bronze record key for classifications: `BUILDING_PERMIT_ID\|PROJECT_TYPE` (one permit may carry multiple project types).
T0	The prediction-time boundary. Features ≤ T0, labels > T0 + horizon. Not in this ETL today; introduced in the proposed Platinum tier.

Sources for this document

GitHub repo: 8020rei-com/buildzoom-etl (default branch main, cloned 2026-05-19).
Files inspected: README.md, bronze/README.md, shared/configs.py, shared/utils.py, bronze/Build_Zoom_*.py, silver/Build_Zoom_Silver_Master.py, gold/Build_Zoom_Gold_Master.py.
Empirical observations on the 2026-05-12 Pinellas (12103) Gold vintage: 5,402,808 rows, 1 Hudi commit, 1.0× version inflation, 54 columns.
Cross-references from this project's findings: 57 (BuildZoom naming audit), 62 (broad taxonomy), 67 (25K list false-positive audit), 72 (South-FL regex), 73 (coverage universe methodology), 74 (independent verification).

BuildZoom ETL — the data's journey, in human language API → Bronze → Silver → Gold · audit notes · 2026-05-19 hover any dotted term for plain-English

1 · Architecture — the swim-lane view

Read this diagram as a contract

2 · Bronze — raw ingestion, four tables

The four tables (one job each)

Quality gate before write

Date sanitization (important)

Hudi configuration

3 · Silver — the fact table join

The 3-join recipe

What Silver actually does to the data

This is where the "Roofing" name first misleads

4 · Gold — First-American address matching

The 3-condition match cascade

How the cascade resolves

FIPS salting — why the code is more complex than it looks

4b · The temporal contract — why we need event_date + observation_period

Edge case to document

5 · QA gate — same shape at every layer

6 · The data's journey — mental map

7 · Issues identified

8 · Opportunities ALL PROPOSAL

Silver Add permit_category enum

Silver Status timeline as array

Gold Per-condition match provenance

Gold Generalize the table name

Bronze Narrow date clip

Bronze Outlier flags, not caps

Platinum Vertical-aware feature snapshots

Gold Canonical event_date + observation_period

Ops Coverage report as first-class artifact

8b · Missing cleaning layer between Bronze and Gold PROPOSAL

What the cleaning layer would do

Two architectural choices

Recommendation

8c · Integrate QA + observability with the matrix hub PROPOSAL

What "matrix hub integration" means concretely

Why this matters (beyond convenience)

Open clarification

9 · Where Platinum fits (and why) PROPOSAL

What Platinum should produce

Why a separate tier (and not a richer Gold)

Recommendation

11 · Deep technical audit — 3 specialist passes

Methodology

11a · Critical correctness bugs (must-fix; silent data loss or wrong output)

11b · Performance & cost HIGH wins

11c · Silent-failure data-quality issues

11d · Cross-cutting architectural concerns

1. shared/utils.py is a 2 853-line megamodule

2. The execution model assumes serial Step Functions and no human re-runs

3. Idempotency is partial and silently asymmetric across layers

4. No CI / lint config — the pipeline ships to prod with no static analysis gate

How to act on this section

12 · Anexo — independent verification of section 11 findings

Methodology

12a · Consensus table

12b · Final tally

8

5

0

12c · What, why, how — for each CONFIRMED finding

Final note on this verification pass

12d · Architectural logic audit — structural issues (separate from code bugs)

Why this section exists

Top 5 structural issues

Other structural issues worth tracking

Architectural verdict

How to read structural vs code-level findings together

10 · Glossary & sources

Sources for this document

4b · The temporal contract — why we need `event_date` + `observation_period`

Silver Add `permit_category` enum

Gold Canonical `event_date` + `observation_period`

1. `shared/utils.py` is a 2 853-line megamodule