BuildZoom ETL — the data's journey, in human language API → Bronze → Silver → Gold · audit notes · 2026-05-19 hover any dotted term for plain-English
permit_category enum is exposed, and the date-sanitize gate lets 2066 through. There is no Platinum tier — ML-ready data is built ad hoc by each consumer today.
1 · Architecture — the swim-lane view
Five actors, each with a distinct responsibility. (This is a swim-lane diagram, BPMN-style.) Read left to right:
Read this diagram as a contract
Each lane owns one job. Bronze ≠ Silver ≠ Gold — when the boundaries blur (e.g. "Roofing_Master_Gold" contains non-roofing rows), downstream consumers compensate, and that compensation is exactly the problem we're trying to fix.2 · Bronze — raw ingestion, four tables
Bronze is the literal mirror of the BuildZoom vendor extract. CSVs land in S3, an EMR Spark job reads them with an explicit schema, runs a QA gate, and writes Hudi COW (copy-on-write) tables to s3://8020rei-bronze-data-lake/. Nothing is joined here.
The four tables (one job each)
| Table | Record key | Partition | Why this partition |
|---|---|---|---|
Roofing_Building_Permits_V1 |
BUILDING_PERMIT_ID |
PARTITION_YEAR from INITIAL_STATUS_DATE |
Time-pruned scans (most queries ask "permits in year X") |
Roofing_Building_Permit_Classifications_V1 |
COMPOSITE_KEY = BUILDING_PERMIT_ID|PROJECT_TYPE |
none | Small table (~hundreds of MB); a permit may carry multiple PROJECT_TYPE rows that later collapse into an array |
Roofing_Building_Permit_Statuses_V1 |
ID = BUILDING_PERMIT_ID + PERMIT_STATUS + EFFECTIVE_DATE |
PARTITION_YEAR from EFFECTIVE_DATE |
Status timeline per permit (multiple statuses per permit, kept as history) |
Roofing_Properties_V1 |
PROPERTY_ID |
PARTITION_FIPS = COUNTY_FIPS |
Geographic pruning (queries scan a few counties at a time) |
Quality gate before write
Every Bronze write must pass a 6-check QA framework before the Hudi write commits. The checks are configurable per table via shared/configs.py; defaults below.
| Check | Threshold (default) | What it catches |
|---|---|---|
| Null % critical fields | < 1 % | Missing primary keys (BUILDING_PERMIT_ID, PERMIT_NUMBER) |
| Null % important fields | < 10 % | Missing context (PROPERTY_ID, TYPE, jurisdiction) |
| Duplicate % on record key | < 0.6 % | Pipeline producing two-of-the-same |
| Completeness score (weighted) | ≥ 75 % | Overall field-fill health |
| Negative % positive-only fields | < 0.1 % | JOB_VALUE < 0, SQUARE_FEET < 0 … |
| Minimum record count | ≥ 1 | Empty deliveries |
If QA fails: data is NOT written, the run is logged to DynamoDB with QA_FAILED, and a Slack alert fires. If QA passes: Hudi write commits, then a success record is logged. This is a real strength of the pipeline.
Date sanitization (important)
Before QA runs, all date columns are clipped:
year > 2100 → 2100-12-31
year < 1900 → 1900-01-01
otherwise → keep as-is
This prevents Spark long-overflow but is too permissive for ML. A date in 2066 (and we have observed these in the 2026-05 vintage) passes the sanitizer and survives into Silver and Gold. See Issue #2.
Hudi configuration
- Table type: COW (copy-on-write) — every commit re-writes the affected files. Readers see consistent snapshots.
- Target file size: 128 MB (clustering job compacts smaller files into this size on a schedule).
- Schema evolution: enabled (new columns can be added without breaking older readers).
- Cleaning: keep latest commit only.
3 · Silver — the fact table join
Silver is one Spark job (silver/Build_Zoom_Silver_Master.py, ~470 lines) that turns four Bronze tables into a single denormalized fact table: Roofing_Master_V1. Each row = one permit, with property and classification glued on.
The 3-join recipe
What Silver actually does to the data
- Permits: extracts
STATEfromPERMIT_JURISDICTION, repartitions byID_PREFIX(first char of permit ID) so subsequent joins distribute evenly. - Properties: uppercases
CITYandSTREET, coalesces nullSTATEto"00", castsCOUNTY_FIPSto int. Persisted across batches. - Classifications:
groupBy(BUILDING_PERMIT_ID).agg(collect_list("PROJECT_TYPE"))— turns the per-permit-per-type rows into one row per permit withPROJECT_TYPEas an ARRAY. - Statuses: Window ranking
row_number() OVER (PARTITION BY BUILDING_PERMIT_ID ORDER BY EFFECTIVE_DATE DESC)then keepsrow_num == 1. Status history is collapsed — only the latest is kept, exposed asAUX_PERMIT_STATUSandAUX_EFFECTIVE_STATUS_DATE.
This is where the "Roofing" name first misleads
The table is calledRoofing_Master, but its filter is not a roofing filter. It's "every permit on every property" — a generic permits fact table. The name is a project-history artifact (the 8020REI vertical is roofing), not a content description.
4 · Gold — First-American address matching
Gold takes Roofing_Master_V1 (silver) and links each permit to one or more First-American property IDs. The output column FA_PROPERTYID is an array — a single permit can resolve to multiple FA properties.
The 3-condition match cascade
How the cascade resolves
The conditions are tried in order. (1) Standardized address — most accurate when present. (2) APN match — deterministic when both sides have one. (3) Street + ZIP fallback — fuzzy, can resolve one permit to many properties. A permit can match more than one condition (different FA properties for each), so the final array can contain multiple IDs. The conditions are tracked separately as FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3 internally before being combined.
FIPS salting — why the code is more complex than it looks
A handful of "big" counties (Maricopa 04013, Harris 48201, Miami-Dade 12086, etc.) hold a disproportionate share of records. A naive Spark partition-by-FIPS would put 60M Maricopa rows on one executor while others sit idle — classic data skew. The salting layer reads a config from S3 (SALTING_CONFIG_PATH) and rewrites the partition key:
Salt_Bucket = 0→ small partitions glued together asSMALL_PARTITIONSalt_Bucket = 1→ small partitions hashed across 30 sub-bucketsSalt_Bucket = -1→ smallest partitions hashed across 250 sub-bucketsSalt_Bucket > 1→ big partitions hashed acrossSalt_Bucketsub-buckets
The output column is Partition_FIPS_Salting. This is an operational hack done well — it absorbs skew but does not change the data semantics.
4b · The temporal contract — why we need event_date + observation_period
Property ID identifies the parcel — a permanent thing. Owner identifies who lived there, and owners change over time. To answer "who owned this property when the permit started" we need both: a property key AND a time anchor. The schema today exposes property ID but no canonical time anchor — eight date columns, no contract.
This proposal would derive two columns in Gold, used by every downstream consumer the same way. None of this exists today.
With this contract every model joins the same way: gold.property_id == rem.property_id AND rem.period == gold.observation_period. The owner returned is who lived there the month before the permit started — no leakage, no per-consumer divergence.
Edge case to document
If every date column is NULL the row getsevent_date = NULL and the as-of join silently drops it. Surface this as a row-count metric in the QA gate so partitions with high NULL-date rates fail loud instead of disappearing.
5 · QA gate — same shape at every layer
Each layer (Bronze and Silver) uses a layer-specific QA class (BuildingPermitsQA, SilverMasterQA) that runs before the Hudi write. The Gold layer uses GoldMasterQA with the same pattern. Three outcomes:
This pattern is uniform across layers. The output contract is "no QA-failed data ever reaches the next layer." That is real discipline.
6 · The data's journey — mental map
One row's story, end to end, from the moment BuildZoom hands us a CSV until a downstream consumer reads it:
7 · Issues identified
HIGH = downstream consumers materially affected · MED = corner cases / future risk · LOW = naming / hygiene
| Sev | Issue | Where | Impact | Suggested fix | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HIGH | Every table in the pipeline is named Roofing_* but contains permits of all categories — this is a permits pipeline, not a roofing pipeline |
Bronze: Roofing_Building_Permits_V1, Roofing_Building_Permit_Classifications_V1, Roofing_Building_Permit_Statuses_V1, Roofing_Properties_V1Silver: Roofing_Master_V1Gold: Roofing_Master_Gold_V1(Repo name buildzoom-etl and S3 bucket names *-bronze/silver/gold-data-lake are already neutral — only table names mislead.) |
The naming assumes a roofing-only product, but the pipeline ingests every permit category that BuildZoom delivers. Empirical evidence: only 3.5 % of rows have TYPE='Roof' in the Pinellas 12 K sample, and only ~18 % contain "roof" anywhere in PROJECT_TYPE (finding 67 §B). Every downstream consumer (Callzeke, Safeguard, Apollo, the new model) re-applies its own roofing filter on top of data labelled as roofing-but-isn't — different filters, different counts, hours of investigation each time numbers diverge. |
Treat as a coordinated rename across Bronze + Silver + Gold. Suggested mapping:
_V1 → _V2) with the new name rather than an in-place rename — keeps the existing consumers reading _V1 until they migrate. Adds breathing room for the cross-repo coordination (Callzeke dashboard, Apollo, new model, Glue catalog, DynamoDB log tables, Slack templates, CloudFormation specs all touch the names). |
||||||||||||||
| HIGH | No permit_category enum exposed — separate but related to the rename |
across Bronze, Silver, Gold | Renaming the tables to Permits_* only fixes the label. Until a coarse permit_category enum (ROOFING / HVAC / SOLAR / ELECTRICAL / PLUMBING / NEW_CONSTRUCTION / OTHER / UNKNOWN) is materialized as a column, every consumer still re-classifies from TYPE + SUBTYPE + DESCRIPTION + PROJECT_TYPE with regex, and findings 57 / 62 / 72 each produced different counts on the same data. |
Pair the rename with a deterministic enum column derived from the existing PROJECT_TYPE array in Silver. Fine sub-classification (e.g., REPLACEMENT vs REPAIR within ROOFING) stays downstream — that's a model decision. |
||||||||||||||
| HIGH | Date sanitize lets 1900-2100 through, including 2066 | bronze/Build_Zoom_Building_Permits.py:82-88 |
Future-corruption rows (we observed LATEST_STATUS_DATE = 2066-08-23 in the 2026-05-12 vintage) survive Bronze, Silver, Gold. Every label and feature pipeline downstream must re-cap. |
Clip to a narrower range like [1995-01-01, today + 90 days] in Bronze. Anything outside is a data-quality signal, not a value to preserve. |
||||||||||||||
| HIGH | No canonical event_date and no observation_period — owner-as-of joins are impossible without them |
Across Bronze, Silver, Gold — the schema exposes eight date columns (INITIAL_STATUS_DATE, LATEST_STATUS_DATE, APPLIED_DATE, ISSUED_DATE, COMPLETED_DATE, CANCELLED_DATE, AUX_EFFECTIVE_STATUS_DATE, DateSnapshot) and zero canonical "this is the date that defines the permit event". |
Property ID alone identifies the parcel. To answer "who owned this property when the permit started" you also need a time anchor — and there is no single canonical date in the schema. Every consumer picks a different column: Apollo uses LATEST_STATUS_DATE; the Callzeke audit used arg_max(CITY, permit_date) across the gold permit date; finding 67 ranged over multiple. The owner state in Silver REM changes monthly, so picking the wrong date silently attributes the permit to the wrong owner. |
Materialize two columns in Gold (or Platinum) as a contract:
|
||||||||||||||
| MED | Silver collapses status history | silver/Build_Zoom_Silver_Master.py · prepare_statuses |
Only the latest status is exposed (AUX_PERMIT_STATUS). Any analysis that needs the timeline (e.g. how long from APPLIED to ISSUED to COMPLETED, intermediate cancellations) must go back to Bronze. |
Keep both: AUX_* for latest + a side artifact permit_status_timeline at the per-permit grain (array of {status, date}). |
||||||||||||||
| MED | No status normalization | everywhere | LATEST_STATUS mixes case and synonyms (complete, final, issued, canceled, void, expired, withdrawn …). Each consumer redefines what counts as "live" vs "dead." |
Add a status_normalized column in Silver with a finite enum (ACTIVE / COMPLETE / CANCELED / EXPIRED / PENDING / UNKNOWN) plus the raw value preserved. |
||||||||||||||
| MED | FA_PROPERTYID as ARRAY surprises downstream |
gold/Build_Zoom_Gold_Master.py |
One permit → N FA properties (street-name matches especially) creates implicit fan-out at consumers that UNNEST without knowing. We measured 50 K-row samples where this matters. |
Document and surface this contract explicitly in a Gold README: "FA_PROPERTYID is a deduped array; consumers must decide whether to UNNEST (fan-out) or treat as a set." | ||||||||||||||
| MED | JOB_VALUE outliers pass through | Bronze QA only checks negative values | $135 M permit values pass Bronze ("not negative") and contaminate downstream summaries. | Add an upper-tail flag (e.g. job_value_outlier_flag at > p99.9) as a sidecar column — do not cap the value silently. |
||||||||||||||
| LOW | Hard-coded date defaults from June 2025 | shared/configs.py · DEFAULT_PERIOD = '2025-06-29' |
If read_last_period_executed ever falls back to default (e.g. corrupted DynamoDB row), the pipeline silently restarts from a year ago. |
Either remove the default and fail loud, or move it to a deploy-time env var. | ||||||||||||||
| LOW | Silver/Gold READMEs missing | repo root | Bronze has an excellent README. Silver and Gold do not, so the contract lives only in code. | Mirror the Bronze README structure for Silver and Gold (schemas, partitions, joins, QA thresholds). | ||||||||||||||
| LOW | No Platinum tier — every model rebuilds features locally | repo has only bronze/, silver/, gold/ |
ML consumers materialize feature snapshots and labels in their own caches. Reproducibility is per-consumer. | See section 9. |
8 · Opportunities ALL PROPOSAL
Beyond fixes, here are higher-leverage improvements:
Silver Add permit_category enum
One deterministic mapping from PROJECT_TYPE array → coarse class. Lives in shared/utils.py, tested with a finite truth table. Unlocks multi-vertical without touching the rest of the pipeline.
Silver Status timeline as array
Keep AUX_* for latest. Add permit_status_timeline = array of {status, date, days_in_status}. No new joins; one window function more.
Gold Per-condition match provenance
Surface which of the 3 conditions matched (already computed internally — just keep the column). Lets downstream weight by match confidence (cond-1 strongest, cond-3 fuzzy).
Gold Generalize the table name
Rename Roofing_Master_Gold to Permits_Master_Gold. The contents are already generic; the name is the artifact.
Bronze Narrow date clip
Clip [1995-01-01, today + 90 days] and reject outliers loudly. Pipeline already has the QA hook; this is a 6-line change.
Bronze Outlier flags, not caps
JOB_VALUE, SQUARE_FEET, FEES: add side-flags at p99.9 cuts without altering the source value.
Platinum Vertical-aware feature snapshots
One ETL produces (T0, FIPS) feature snapshots that serve every vertical model (roofing, HVAC, solar). Labels per vertical live in vertical-scoped sub-paths. See section 9.
Gold Canonical event_date + observation_period
event_date = earliest non-null of the 6 status dates ("when the permit started being worked on"). observation_period = event_date month-end minus 1 month — the Silver REM snapshot key for joining to the owner / distress / valuation that existed before the permit. Materialize both in Gold so every downstream model joins to the same time anchor. See section 4b above for the diagram + edge cases.
Ops Coverage report as first-class artifact
Today coverage is reconstructed by scripts/roofing/build_coverage_universe.py in another repo. Move the per-(FIPS, jurisdiction) coverage table into Gold or Platinum as a canonical output so consumers don't keep rebuilding it.
8b · Missing cleaning layer between Bronze and Gold PROPOSAL
None of this exists today. The current Silver job is structurally a join — it consolidates the four Bronze tables into a denormalized fact, but its cleaning work is minimal: uppercase CITY and STREET, cast COUNTY_FIPS to int, coalesce missing STATE to "00", keep the latest status. Everything else flows through to Gold unchanged. Result: Gold is still dirty — future dates survive, OCR typos survive, status case-variants survive, sentinel values survive, outlier JOB_VALUE survives.
Each downstream consumer ends up re-doing the same cleaning work locally — inconsistently. This is exactly the kind of repeated work a tier is supposed to eliminate.
What the cleaning layer would do
Two architectural choices
| Option | Shape | Pros | Cons |
|---|---|---|---|
| Option A — new sub-tier | Bronze → Bronze_Cleaned → Silver → Gold |
Bronze stays a literal immutable mirror (the gold rule of raw data lakes). Cleaning is a separate ownable artifact with its own QA gate. Easy to compare cleaned-vs-raw rows for audit. | Another tier to maintain. Storage cost doubles for the cleaned mirror (although clustering can manage this). |
| Option B — harden Silver | Bronze → Silver (now: clean + join) → Gold | No new tier. Same path Silver already takes; just more work per row. | Silver's job grows from "join" to "join + clean" — bigger function, harder to test in isolation. Audit trail of cleaning vs raw is lost (cleaned values overwrite originals). |
Recommendation
Option A if storage is cheap (it is — S3 + Hudi compresses well). The clean separation between "raw mirror" and "cleaned for analytics" makes the contract auditable: every cleaning rule lives in one place, every cleaned row carries a provenance pointer back to the Bronze row it came from. Silver and Gold simplify in return.This layer also becomes the natural home for the rename (issue #1), the permit_category enum (issue #2), the date sanitize fix (issue #3), the event_date + observation_period contract (section 4b), the status normalization (issue #5), the outlier flagging (issue #7). Today these are scattered across "Bronze sanitize is too loose," "Silver doesn't normalize," and "every consumer re-derives." A cleaning layer collapses them into one job with one contract.
8c · Integrate QA + observability with the matrix hub PROPOSAL
None of this exists today. The current pipeline's observability is self-contained: DynamoDB tables (buildzoom_etl_logs, buildzoom_qa_metrics) + Slack webhook + CloudWatch. To know "did anything fail this week?" or "what's the trend of completeness scores?" requires logging into DynamoDB and writing queries.
The requirement: every Bronze / Cleaning / Silver / Gold / Platinum QA run should write its result to the central matrix hub on Eritwini (clarify platform name) so observability lives where the rest of the company already looks — not in pipeline-specific logs that only DE can read.
What "matrix hub integration" means concretely
| Current | Proposed |
|---|---|
QA result lands in DynamoDB.buildzoom_qa_metrics |
Same row also pushed to the matrix hub via API / webhook |
| Failures alert Slack channel only | Failures alert Slack and create an entry in the matrix-hub incident view (visible to non-DE stakeholders) |
| Run history lives in DynamoDB tables, only queryable via AWS console / scripts | Run history surfaced as a matrix-hub dashboard: per-table, per-vintage, per-QA-rule. Time-series of completeness scores, duplicate %, null %, etc. |
| Each consumer (Callzeke dashboard, Apollo, new model) builds its own "did the upstream pipeline run cleanly?" check | Single matrix-hub query answers "is upstream healthy?" for every consumer |
| Schema drift detection lives in QA code, surfaced as Slack alerts | Schema drift surfaced as a matrix-hub dashboard tile with history |
Why this matters (beyond convenience)
- Cross-team visibility. DS and analytics teams shouldn't need DE help to verify upstream health. Today they do.
- Audit trail. Compliance / external review can read the matrix-hub log without ever touching pipeline infrastructure.
- Trend detection. Slack alerts catch acute failures; the matrix hub catches slow drifts (completeness score sliding from 88 → 82 → 76 over six weeks).
- Single source of operational truth. Same pattern as for product KPIs and business metrics — pipeline health should live in the same place.
Open clarification
"Eritwini" — please confirm the exact platform name (suspecting transcription of an internal hub or a vendor like Airtable / Datadog / Monte Carlo / a homegrown tool). Once confirmed, the integration approach (REST API push, S3 sync, webhook, native connector) follows from the platform's supported ingestion methods.9 · Where Platinum fits (and why) PROPOSAL
None of this exists today. The current repo has Bronze, Silver, Gold — no Platinum tier. Today every downstream model materializes its own feature snapshots and labels, divergently. This section describes what would close that gap.
What Platinum should produce
| Artifact | Grain | Lives at | Built from |
|---|---|---|---|
| Labels (per vertical) | (property_id, T0) | platinum/labels/<spec_v>/<gold_vintage>/<FIPS>/ |
Gold + a versioned SPEC.md defining "what counts as a positive" |
| Feature snapshot (shared across verticals) | (property_id, T0, FIPS) | platinum/features/<feature_set_v>/<T0>/<FIPS>/ |
Silver REM (as-of T0) + Gold permit history per category + macro |
| Training set (labels ⋈ features) | (property_id, T0) | platinum/training_sets/<run_id>/ |
One labels artifact + one feature artifact, frozen with a manifest |
| Coverage table | (FIPS, jurisdiction, T0) | platinum/coverage/<coverage_v>/<T0>/ |
Gold + Silver SFH universe |
Why a separate tier (and not a richer Gold)
- Gold serves many consumers. Adding model-specific as-of joins or label decisions to Gold would force every consumer to inherit them. Platinum is consumer-specific (per-vertical, per-spec_v).
- Spec lives with the domain. Whether a permit "counts" depends on the model's question (REPLACEMENT-only roofing vs all-permits HVAC). That decision belongs with the data scientist, not the data engineer.
- Reproducibility. A Platinum manifest pins (Gold vintage SHA, spec version, feature_set version, contract violations at build time, row counts per drop rule). Every training run is auditable from a single JSON.
Recommendation
Add aplatinum/ directory to this ETL repo. Production Platinum ETL runs on the same EMR / Hudi infrastructure (no new vendor). The label / feature spec lives in the DS repo (where the domain knowledge is) and is referenced by version from the Platinum build. Same operational discipline as Bronze / Silver / Gold; new contract.
11 · Deep technical audit — 3 specialist passes
Methodology
3 independent agents read the full repo (8 205 LOC across 11 PySpark scripts + CloudFormation), each with a different lens: Performance & cost, data quality & correctness, code review. 90+ findings combined. The table below distills only the cross-cutting and HIGH-severity items — every claim citesfile:line. Recommendations carry the PROPOSAL tag because no fix has been implemented yet.
11a · Critical correctness bugs (must-fix; silent data loss or wrong output)
| Sev | Bug | File:line | Why it matters |
|---|---|---|---|
| HIGH | Config.QA_THRESHOLDS attribute doesn't exist — every QA-class fallback path raises AttributeError |
shared/utils.py:840 (vs the real attribute Config.DEFAULT_QA_THRESHOLDS at :47) |
Today masked because every caller passes qa_thresholds=... explicitly. The day someone forgets, the entire Bronze QA stage crashes. Single-character bug fix. |
| HIGH | Hudi options misspelled hudi.datasource.* instead of hoodie.* — silently ignored by Hudi |
shared/utils.py:573-580 · bronze/Build_Zoom_Permit_Classifications.py:48-53 |
Three schema-validation guards never run: allow.schema.mismatch, schema.validate, reconcile.schema. The team thinks schema is being validated; it's not. Vendor-side schema changes pass through invisibly. |
| HIGH | Silver join on derived STATE silently drops permits and creates fan-outs |
silver/Build_Zoom_Silver_Master.py:151-158, 184-190 |
STATE = upper(split(PERMIT_JURISDICTION, "_")[0]). Any jurisdiction whose first underscore-token isn't a USPS code ("CHICAGO_IL", "NEWYORK_NY_MANHATTAN", formats without underscores) silently maps to a wrong state → join misses every property. Permits with null jurisdiction get STATE="00" and join to every STATE="00" property — unintentional cartesian explosion. |
| HIGH | COUNTY_FIPS cast to int strips leading zeros — violates project hard-rule #1 |
bronze/Build_Zoom_Properties.py:54-63 · silver/Build_Zoom_Silver_Master.py:91 |
Alabama 01001 → 1001. Partitions write to PARTITION_FIPS=1001/; every external lookup (Census, FRED, BLS, FHFA, ACS) joins on the 5-char string and silently returns zero rows. Internal cross-references also break. |
| HIGH | EFFECTIVE_DATE allowed NULL in composite Hudi record key — distinct rows silently collide |
bronze/Build_Zoom_Permit_Statuses.py:42-63 · shared/configs.py:114 |
RECORD_KEY = concat_ws("-", BUILDING_PERMIT_ID, PERMIT_STATUS, EFFECTIVE_DATE). concat_ws skips nulls — so two genuinely distinct status events with null date collapse to the same key and Hudi upsert silently overwrites. Per config comment ~30 % of statuses have null date. |
| HIGH | Clustering health check reports row-count regression as warning, exits 2 (non-fatal) |
bronze/Hudi_Clustering_Maintenance.py:413-424 · silver/Hudi_Clustering_Maintenance.py:291-301 |
Pre-vs-post row count mismatch >1 % is logged as a warning with comment "might be OK if data was being written concurrently." It is not OK — that's exactly the data-loss / duplication scenario. Step Functions treats exit 2 as non-fatal; silent loss with a yellow icon. |
| HIGH | DynamoDB execution log has no ConditionExpression — concurrent writes overwrite each other |
shared/utils.py:178-242, 297-339 |
Cursor (read_last_period_executed) computed by client-side sort over full history; writes are unconditional put_item keyed on (table_name, execution_timestamp_iso). Two jobs landing in the same second silently overwrite; two concurrent workers both decide "partition X is next" and both write SUCCESS — duplicate Hudi upserts plus a lying log. |
| HIGH | Athena snapshot swap is non-atomic — old data deleted before new is confirmed queryable | gold/Build_Zoom_Gold_Master.py:920-986 |
Three-step sequence: ALTER LOCATION → ALTER TBLPROPERTIES → delete old prefix. Between steps the table can return zero rows or scan dead paths. No smoke-test query before the delete. If property update fails, no rollback. If delete succeeds but downstream issue forces rollback, data is gone. Polling loop while True: sleep(1) is unbounded — stuck Athena hangs Gold forever. |
| HIGH | Gold address matching has no key normalization — silently misses matches | gold/Build_Zoom_Gold_Master.py:319-323, 347-356, 385-389 |
(1) "ST" vs "STREET" vs "AVE"/"AVENUE" — straight string compare, no USPS expansion. (2) Suite / unit / apt suffixes not normalized ("APT 4B" ≠ "#4B" ≠ "UNIT 4B"). (3) ZIP_CODE cast to Integer strips leading zeros — every New England state (MA, CT, RI, NJ, VT, ME, NH) loses ZIP matching. (4) APN format varies by county (LA: 1234-567-890 vs 1234567890; Cook: 12-34-567-890-1234) — no regexp_replace, Cook can lose 100 % of APN matches. |
| HIGH | Gold aggregate_properties groupBy on all 60+ columns; coalesce kills multi-condition matches |
gold/Build_Zoom_Gold_Master.py:421-491 |
(1) Two roofing rows differing in even one nullable text field produce two Gold rows for the same BUILDING_PERMIT_ID instead of one with a 2-element array. (2) Code uses coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3) before collect_list — so a permit matching FA "a" by std-address AND FA "b" by street name only keeps ["a"]. The docstring's promise "Multiple FA_PROPERTYIDs aggregated into array" is contradicted by the code. (3) collect_list not collect_set, order non-deterministic — consumers doing FA_PROPERTYID[0] get different answers per rebuild. |
| HIGH | F.current_timestamp() as Hudi precombine key — re-runs are non-deterministic |
shared/utils.py:723 · all four Bronze drivers (Permit_Statuses:69, Properties:69, Permit_Classifications:121, Building_Permits via transform) |
Re-ingesting the same source file produces different precombine values; you can never reliably skip a duplicate or verify "is this the same row as last week". Worse: in the same Spark stage, executor-local clocks can resolve out of order, so a tie-break inside Hudi picks arbitrarily. |
| HIGH | Silver latest-status window ties not broken — re-runs pick different "latest" | silver/Build_Zoom_Silver_Master.py:118-129 |
Window.partitionBy("BUILDING_PERMIT_ID").orderBy(EFFECTIVE_DATE.desc()). Two statuses on the same day → row_number()=1 non-deterministic. Same Bronze → different Silver across runs. |
| HIGH | PERMISSIVE CSV mode silently nullifies bad rows — no columnNameOfCorruptRecord sidecar |
bronze/Build_Zoom_Building_Permits.py:67-75 + analogous in all 4 Bronze readers |
A row that fails to parse becomes typed-nulls instead of being quarantined or logged. The QA's null-percentage gate may not trip if only one field is corrupt. No count of rejected rows surfaces anywhere. |
| HIGH | BuildingPermitsQA mutates class attributes at runtime across 4 drivers |
bronze/Build_Zoom_Building_Permits.py:350-352 + 3 analogous (Properties, Permit_Statuses, Permit_Classifications) |
Each driver does qa_checker.CRITICAL_FIELDS = SomeConfig.CRITICAL_FIELDS on a single instance. Works only because each driver runs in a separate process. Any consolidation (test harness, notebook) bleeds state between tables. Class is also misnamed for one of the four tables it now serves. |
11b · Performance & cost HIGH wins
Estimated achievable: 40-60 % wall-clock reduction + 30-50 % cost reduction on the Gold layer alone (the most expensive). Numbers are agent estimates; require real billing data to confirm.
| Fix | Mechanism | Impact |
|---|---|---|
Eliminate Gold's 16+ wide-shuffle .count() calls for logging |
gold/Build_Zoom_Gold_Master.py:151,182,226,271,341,373,402,440,445,451,452,488. Replace per-step counts with one aggregated .agg(F.count_when(...)) at the end; gate behind logger.isEnabledFor(DEBUG). |
30-50 % Gold wall-clock |
| EMR cluster lifecycle — keep one cluster across Bronze + Silver + Gold | Today three independent Step Functions each spin EMR up/down (~8 min × 3 = 24 min bootstrap per run). Single state-machine with sub-state-machines on one cluster ID. |
24 min/run + 3× bootstrap cost |
| Add EMR managed scaling + spot TASK group | All three CFNs hard-code 3 × r7i.4xlarge core, no managed scaling, no TASK group, no spot, gp2 EBS. Bronze tolerates spot churn (Hudi rollback). Switch EBS to gp3 (20 % cheaper, faster). |
2-4× throughput/$ |
| Don't re-read the entire Hudi target for QA on every partition | bronze/Build_Zoom_Building_Permits.py:109 + analogous. QA's new-vs-updated computation does spark.read.format("hudi").load(hudi_path) with no filter — scans 483 M rows for classifications, >50 M for properties, on every partition. Project only the record-key column and filter by source's partition keys before the join. |
50-80 % per-partition QA time |
| Replace 60-column groupBy with key-only aggregation + back-join | gold/Build_Zoom_Gold_Master.py:480-484. matched_df.dropDuplicates(['BUILDING_PERMIT_ID']).drop('FA_PROPERTYID') then join the collect_list back on BUILDING_PERMIT_ID alone. |
Huge — 60-tuple shuffle eliminated |
Switch bulk_insert mode from per-partition append to single union write |
gold/Build_Zoom_Gold_Master.py:612-660 + loop at :873-881. bulk_insert is intended as a single shot — 67 calls inside mode("append") repeatedly cycles Hudi metadata. Either union all partitions into one DataFrame before writing, or scale hoodie.bulkinsert.shuffle.parallelism dynamically by row count. |
30-60 % Gold write time |
Bump hoodie.cleaner.commits.retained from 1 to 3 |
shared/utils.py:586. Current value strips all rollback ability and works against COW write amplification. |
Recovery margin + fewer cleanup writes |
Stop using df.limit(1) upsert to trigger clustering |
The "limit-1 upsert with inline clustering" trick rewrites real file groups on every "trigger" call (bronze/Hudi_Clustering_Maintenance.py:290, silver:237, gold/Run_Clustering.py:79). Use Hudi native CALL run_clustering(...) procedure; on failure, log and skip — never write phantom records. |
20-40 % clustering cost |
| Single Spark session builder; eliminate 7× copy-pasted boilerplate | Build_Zoom_Building_Permits.py:294-334 + 6 analogous. shuffle.partitions, coalescePartitions, executor sizing all drift between copies. Extract shared.spark.make_session(). |
Consistency + faster onboarding |
DynamoDB query: add GSI or use Limit + ScanIndexForward=False |
shared/utils.py:198-216, 262-275, 369-394. Today every job start scans full history. |
O(N) DynamoDB RCU growth eliminated |
11c · Silent-failure data-quality issues
| Sev | Failure mode | File:line |
|---|---|---|
| HIGH | "00" sentinel STATE creates fan-out matching (covered in 11a) | silver/Build_Zoom_Silver_Master.py:151-158 |
| HIGH | PARTITION_YEAR=9999 sentinel for null INITIAL_STATUS_DATE rows is never in batch_partitions filter → Silver silently drops them | shared/utils.py:716 · silver/Build_Zoom_Silver_Master.py:148 |
| HIGH | Properties schema_evolution=true + auto.evolution.column.type.override=STRING means a new vendor column lands as STRING regardless of true type — silent type coercion | shared/utils.py:573-580 |
| HIGH | Salt_Bucket=0 → "SMALL_PARTITION" literal groups all small-FIPS into one bucket — cross-county APN collisions become possible (BuildZoom permit in FIPS A matches FA property in FIPS B by APN string) | gold/Build_Zoom_Gold_Master.py:91-93, 354-356 |
| HIGH | Classifications collect_list not collect_set — duplicate PROJECT_TYPE rows in Bronze produce ["Roofing", "Roofing"] arrays in Silver | silver/Build_Zoom_Silver_Master.py:67-69 |
| MED | Multi-line DESCRIPTION with embedded newlines breaks CSV parsing (multiLine not set on 3 of 4 readers) | bronze/Build_Zoom_Building_Permits.py:67-75 + 2 analogous |
| MED | extract_partitions_inside_folder picks first match by sort if vendor accidentally drops a backup file (e.g. .csv.bak) — no assertion that exactly 1 file matches | bronze/Build_Zoom_Building_Permits.py:419-422 + analogous |
| MED | No referential integrity check: every BUILDING_PERMIT_ID in Permit_Statuses should exist in Building_Permits — orphan status rows silently inflate Silver | no test exists |
| MED | Completeness score is gamable — weighted average over all columns means adding always-full Hudi meta columns dilutes bad fields | shared/utils.py:988-1009 |
| MED | No value-distribution drift check vs previous partition — a jurisdiction dropping 80 % of permits goes unflagged | shared/utils.py:1042-1060 |
| MED | QA "initial-load" path treats df_target = None as initial — if the target read fails silently (permission denied), every record becomes "new" and you ship a full duplicate | shared/utils.py:923-924 |
| MED | Silver's "Real Estate Master period" is "freshest S3 prefix" — not deterministic across Gold partition batches if Real Estate publishes mid-Gold-run | gold/Build_Zoom_Gold_Master.py:200-203 |
11d · Cross-cutting architectural concerns
1. shared/utils.py is a 2 853-line megamodule
It contains Config, schemas, S3 helpers, DynamoDB helpers, checkpoint logic, Hudi write helpers, transformations, Slack helpers, and three near-duplicate QA classes (BuildingPermitsQA, SilverMasterQA, GoldMasterQA — ~85 % identical, including a bug fix made in Silver/Gold but missing from Bronze around _hoodie_* column filtering). Every change has high blast radius and every test requires booting Spark + faking AWS. Split into shared/{spark,config,s3,dynamodb,hudi,qa,notifications}.py with a thin BronzePipeline / SilverPipeline / GoldPipeline base class to absorb the partition-loop boilerplate (currently copy-pasted 4×) and the Spark-session builder (currently copy-pasted 7×).
2. The execution model assumes serial Step Functions and no human re-runs
Nothing in code enforces "one writer per table". Protection is by convention — README says "ONLY when no ETL jobs are running"; CLAUDE.md says "don't edit mid-run". A senior shop would put a DynamoDB-backed lock around each (table_name, partition) and around the clustering job, with TTL and ownership records, and would make every put_item use ConditionExpression. Until that exists, the _V1 / _V2 versioning scheme is the only rollback you have — and snapshot deletion in Gold (issue 11a #8 above) erases it.
3. Idempotency is partial and silently asymmetric across layers
Bronze uses upsert with a non-deterministic timestamp precombine — safe but expensive on re-run. Silver runs full-refresh upsert — safe but rewrites the entire universe per run. Gold uses bulk_insert — fast but not idempotent if a single partition re-runs (no built-in dedup). Three layers behave differently under the same Step Functions retry policy. Standardize on one strategy per layer with documented retry semantics and surface "this partition has been written N times this snapshot" in DynamoDB so monitoring catches partial-success retries.
4. No CI / lint config — the pipeline ships to prod with no static analysis gate
No ruff, no mypy, no black, no requirements.txt visible. Dead imports, misspelled Hudi options, missing type hints, and dead args accumulate without anyone noticing. This is the cheapest single fix on the entire list — a 1-hour add-CI PR catches future regressions of every category in 11a / 11c.
How to act on this section
The 13 HIGH bugs in 11a are correctness issues — they produce wrong output today, not just future risk. The performance wins in 11b are agent estimates that need real billing data to confirm. The full agent transcripts (90+ findings) are too long to embed here; they live in the agent task outputs and can be re-derived by re-running the three specialist agents over the repo (Performance Benchmarker, Data Engineer, Code Reviewer).12 · Anexo — independent verification of section 11 findings
Methodology
The 13 HIGH-severity bugs in §11a were originally surfaced by three specialist agents in a single pass. To avoid rubber-stamping, two independent verifier agents (Code Reviewer + Data Engineer) were dispatched separately, told not to trust the original wording, and asked to read the cited code fresh. They produced verdicts of CONFIRMED / PARTIAL / EXAGGERATED / REFUTED / REQUIRES-DATA. The "Consensus" column below reconciles their two views.12a · Consensus table
| # | Finding (one-line) | V1 | V2 | Consensus | What changes from §11a |
|---|---|---|---|---|---|
| F1 | Config.QA_THRESHOLDS attribute missing | CONFIRMED | CONFIRMED | CONFIRMED | None. Pure typo, latent because every caller passes qa_thresholds=. |
| F2 | Hudi options hudi.datasource.* silently ignored | CONFIRMED | CONFIRMED | CONFIRMED | Refined fix: not a simple prefix swap — 2 of 3 keys (allow.schema.mismatch, schema.validate) have no real hoodie.* equivalent. Author intent ("validate + reconcile") needs translation to actual Hudi keys: hoodie.datasource.write.reconcile.schema + hoodie.avro.schema.validate. |
| F3 | Silver join on derived STATE drops & fan-outs | CONFIRMED | CONFIRMED | CONFIRMED | Both failure modes empirically defensible. Magnitude data-dependent (depends on real PERMIT_JURISDICTION formats), but the "00" sentinel collision is structural — happens the first time any null arrives on either side. |
| F4 | COUNTY_FIPS cast to int strips leading zeros | CONFIRMED | CONFIRMED | CONFIRMED | Bronze actually handles nulls/non-numeric with a "00000" sentinel at :54-60 — and then casts to int at :63, throwing away the work. The author knew FIPS is a string then ignored it. |
| F5 | NULL EFFECTIVE_DATE in composite Hudi key | CONFIRMED | CONFIRMED | CONFIRMED | concat_ws null-skip behavior verified in Spark 2.x / 3.x / 4.x. Per config comment ~30 % nulls → wide collision surface. |
| F6 | Clustering row-count drift >1% becomes warning + exit 2 | PARTIAL | PARTIAL / EXAGGERATED | PARTIAL | The drift-to-warning downgrade is real. The "exit 2 is silent" claim depends on how the Step Functions caller treats non-zero exits. REQUIRES-DATA: confirm the State Machine doesn't special-case exit 2. Bigger architectural smell: the "concurrent writes" excuse implies clustering can race with ETL writers — that's the real bug. |
| F7 | DynamoDB execution log race | CONFIRMED | PARTIAL / EXAGGERATED | PARTIAL — refined | Real bug is different from original framing. The "two concurrent writes within the same second" is unlikely at ISO-microsecond precision. The genuine concern: read_last_period_executed does a full unpaginated DynamoDB Query, response is truncated to 1 MB by AWS — at ~10 K items the cursor silently misses older successes that may be the actual max-period. Fix: add pagination loop. |
| F8 | Athena snapshot swap non-atomic + delete unverified | PARTIAL | CONFIRMED | CONFIRMED — refined | Code does check Athena query state == SUCCEEDED (V1's correction). What's still missing: a SELECT smoke-query against the new location after the ALTERs and before the delete. Polling loop is unbounded — stuck query hangs forever. Two-ALTER sequence is non-atomic. |
| F9 | Gold address matching has no key normalization | CONFIRMED | PARTIAL | PARTIAL — refined | Std-line + Std-zip ARE handled upstream by the Standardized_Addresses_* tables — Condition 1 in Gold consumes them correctly. Real remaining gaps: (a) APN normalization (raw PARCEL_NUMBER == APN, no regexp_replace, format varies per county), (b) ZIP_CODE cast to IntegerType drops leading zeros (whole New England loses ZIP joins). Condition 3 street normalization opaque — needs more investigation. |
| F10 | Gold coalesce + 60-col groupBy kills multi-condition matches | CONFIRMED | REFUTED-as-stated, real bug present | CONFIRMED | Both verifiers converged: bug is real, mechanism is exactly what was claimed (coalesce runs BEFORE collect_list, kills multi-match). V2's "refuted-as-stated" was about minor framing wording, not the bug substance. |
| F11 | F.current_timestamp() as precombine non-deterministic | CONFIRMED | PARTIAL / EXAGGERATED | PARTIAL — context-dependent | Under single-writer architecture (this pipeline today — Step Functions serializes), current_timestamp() is safe enough. The risk bites if (a) anyone re-runs a partition expecting idempotency, (b) two parallel jobs ever touch the same key (forbidden by convention but not by code). Status as a HIGH depends on whether the team wants re-run idempotency. |
| F12 | Silver latest-status window tie not broken | CONFIRMED | CONFIRMED | CONFIRMED | Spark window sort is provably not stable across the shuffle exchange. Reproducibility broken. Compounds with F5 (null-date status collisions) — a permit with 3 null-date statuses has 3 ties at the top → any can be "latest". |
| F13 | PERMISSIVE CSV silently nullifies bad cells, no audit trail | CONFIRMED | CONFIRMED | CONFIRMED | Properties.py is worse than originally stated — it sets no mode at all, defaulting to PERMISSIVE implicitly (other three Bronze readers at least set it explicitly). Per-cell null (not whole-row null) — equally dangerous because it's silent. |
12b · Final tally
8
CONFIRMED
F1 · F2 · F3 · F4 · F5 · F8 · F10 · F12 · F13
5
PARTIAL / REFINED
F6 · F7 · F9 · F11
(real bug, refined framing or context-dependent)
0
FULLY REFUTED
No finding survived as completely wrong.
12c · What, why, how — for each CONFIRMED finding
Compressed implementation-level summary. Numbers in parens are the original finding ID from §11a.
| ID | WHAT (the bug) | WHY (consequence) | HOW (one-line fix) |
|---|---|---|---|
| F1 | Config.QA_THRESHOLDS doesn't exist — fallback path crashes | Latent AttributeError; today masked by every caller passing kwarg explicitly | shared/utils.py:840: Config.QA_THRESHOLDS → Config.DEFAULT_QA_THRESHOLDS |
| F2 | 3 schema-validation options use hudi. prefix instead of hoodie. — silently ignored | Schema-validation guards never run; opposite of intent ships to prod | Replace with real keys: hoodie.datasource.write.reconcile.schema + hoodie.avro.schema.validate. Drop the third (no real equivalent). |
| F3 | Silver derives STATE via upper(split(JURISDICTION, "_")[0]) with "00" fallback; Properties does same with raw STATE | Permits with non-USPS jurisdiction → lost from join. Null jurisdiction → cartesian fan-out against all "00" properties. | Replace coalesce-to-sentinel with null-safe equals + explicit drop count; carry raw PROPERTY_STATE from Properties and join on that. |
| F4 | COUNTY_FIPS + PARTITION_FIPS cast to IntegerType | Alabama 01001 → 1001. S3 partition path mangled. External FIPS lookups (Census, FRED, BLS, FHFA, ACS) return zero rows. | Keep StringType with F.lpad(col.cast("string"), 5, "0") — no int cast at any layer. |
| F5 | concat_ws("-", BPID, STATUS, EFFECTIVE_DATE) as RECORD_KEY with EFFECTIVE_DATE allowed NULL — concat_ws skips nulls | Two distinct status events with null date collapse to same Hudi key; upsert silently overwrites; per config ~30% of statuses have null date | F.concat_ws("-", BPID, STATUS, F.coalesce(EFFECTIVE_DATE.cast("string"), F.lit("__NULL__"))) |
| F6 | Clustering row-count drift >1% logged as warning + exit 2; "might be OK if concurrent writes" comment | Possible data loss/duplication exits with non-zero code but Step Functions handler MAY treat as success. Concurrent-write excuse signals deeper coordination gap. | Promote drift to hard failure (sys.exit(1)). Acquire write lock on table before clustering. Document exit-code contract. |
| F7 | read_last_period_executed does unpaginated DynamoDB Query → response truncated to 1 MB at ~10K items → may miss older successes | Cursor advancement is wrong at scale; jobs may re-process completed partitions or skip uncompleted ones | Add pagination loop with LastEvaluatedKey; better, replace with UpdateItem on a single "last_known_success" row with conditional expression. |
| F8 | Athena LOCATION swap = 2 ALTERs + delete-old, no SELECT smoke-test between, polling loop unbounded | Inconsistent window between ALTERs (zero-row queries possible); stuck Athena hangs forever; failed second ALTER leaves table broken + old data already gone | Single Glue catalog API update (transactional). SELECT COUNT(*) smoke-test after ALTERs, before delete. Bound polling with max_seconds=300. Soft-delete to quarantine/ with lifecycle rule. |
| F9 | Gold address matching: APN raw equality with no normalization, ZIP cast to int strips leading zeros | APN format varies by county (LA 1234-567-890 vs 1234567890; Cook 12-34-567-890-1234) → silent miss. New England ZIPs 07021 → 7021 (both sides same int cast, so they match each other — but downstream ZCTA joins via Census fail). Std-line + std-zip are handled upstream. | regexp_replace(APN, "[^A-Z0-9]", "") on both sides + per-FIPS APN format profile. Cast ZIP to lpad(string, 5, "0"), not int. |
| F10 | coalesce(FA_PROPERTYID1, FA_PROPERTYID2, FA_PROPERTYID3) picks first non-null BEFORE collect_list; 60-col groupBy keeps spurious row splits | Permit matching FA-A by std-address AND FA-B by APN keeps only FA-A — promised multi-match arrays don't materialize. Docstring contradicts code. | Replace coalesce with array_distinct(array(id1, id2, id3)) per row, then groupBy(BUILDING_PERMIT_ID).agg(array_distinct(flatten(collect_list(...)))). Drop the 60-col group. |
| F11 | F.current_timestamp() as Hudi precombine — value depends on processing time, not source data | Re-ingestion is not a no-op; re-run "wins" over original; concurrent writers race | For Hudi precombine, use source-derived date (e.g. LATEST_STATUS_DATE) or driver-side lit(datetime.utcnow().isoformat()) materialized once per write. Safe under current single-writer model. |
| F12 | Latest-status window: orderBy(EFFECTIVE_DATE.desc()) with no tiebreaker → row_number()=1 non-deterministic on ties | Same input → different Silver across runs. Reproducibility broken. Compounds with F5: null-date ties at the top. | Add a deterministic secondary key: .orderBy(EFFECTIVE_DATE.desc_nulls_last(), F.col("PERMIT_STATUS").desc(), F.col("ts").desc()) — or define a domain status priority. |
| F13 | PERMISSIVE CSV mode with typed schema nullifies bad cells per-cell; no columnNameOfCorruptRecord → bad rows discarded silently. Properties.py doesn't even set the mode (implicit PERMISSIVE). | Vendor format drift produces silent nulls (e.g. JOB_VALUE = "N/A" → null), QA threshold max_null_pct_critical_fields = 1% may not trip on narrow drifts | Add .option("columnNameOfCorruptRecord", "_corrupt_record") + sidecar column; count and emit to QA metrics. Set explicit mode on Properties. |
Final note on this verification pass
Zero findings were fully refuted. Eight are confirmed outright, five are confirmed-with-refined-context. The five PARTIALs do not mean "go ahead and ignore" — they mean the recommended fix or the severity rating shifts. Treat all 13 as real items in the engineering backlog; the verifier nuance is what informs prioritization, not whether to fix at all.12d · Architectural logic audit — structural issues (separate from code bugs)
Why this section exists
The bugs in §11a are code-level — typos, race conditions, missing normalizers. They can be fixed without touching the architecture. A 4th independent agent (Software Architect) audited the logic of the pipeline itself — layer responsibilities, contract boundaries, naming, dependency graph, reproducibility model. Goal: find structural issues that mean "even if every bug were fixed, this design would still be wrong in important ways."Top 5 structural issues
| # | Issue | Where | Severity |
|---|---|---|---|
| S1 | Silver advertises a partition key (PARTITION_FIPS) that the Silver job never creates. Config declares PARTITION_KEY = "PARTITION_FIPS"; the Silver job batches by PARTITION_YEAR and produces a PARTITION_FIPS column only for permits whose property-join succeeds (when join fails → NULL FIPS → row silently disappears from any Gold filter on FIPS). Worse: every Silver batch touches every FIPS partition file — defeats partitioning as an idempotency boundary. The "zero-downtime full refresh" is a marketing label, not a property. |
shared/configs.py:270 + silver/Build_Zoom_Silver_Master.py:164-283 |
HIGH |
| S2 | The STATE column shadows two different concepts and is the silent root of join brokenness. Properties: STATE = postal state from assessor. Permits: STATE = re-derived in Silver from upper(split(PERMIT_JURISDICTION, "_")[0]). Same column name, completely different meaning. Join treats them as equal — they are not. BuildZoom permit jurisdictions are city/county/agency identifiers whose prefix is sometimes a state code, sometimes a FIPS string, sometimes a city slug. Layer-responsibility leak: Silver re-derives a column with the same name to make a join work. |
silver/Build_Zoom_Silver_Master.py:94, 150-154, 186-190 |
HIGH |
| S3 | Snapshot semantics are pinned to Bronze, but Gold output fuses 4 independently-versioned upstream tables. Gold builds snapshot suffix from get_last_bronze_date_snapshot() and writes Roofing_Master_Gold_V1_<YYYYMMDD>. But the join body reads (1) Silver Roofing_Master with no period filter, (2) Silver Real_Estate_Master filtered by S3-folder-listed last_period, (3-4) two Standardized_Addresses tables with no version pin. Three of these tables are not even built by this repo. One Gold "snapshot" stamps a Bronze date onto an output fused from 4 vintages; reproducibility from Gold row → source is broken at the moment of writing. Athena ALTER + delete-old is destructive — no rollback after the flip. |
gold/Build_Zoom_Gold_Master.py:734-772, 188-209 + shared/utils.py:245-282 |
HIGH |
| S4 | Gold mixes data engineering with business decisions — the FA enrichment is a domain join, not a tier. The medallion definition would have Gold contain aggregated business-ready facts. What this Gold actually contains is a probabilistic entity-resolution result: 3-condition fuzzy address match, fall-through coalesce with implicit trust ordering, multiple FA_PROPERTYIDs collected into an array. The match policy is invisible to consumers (no per-row quality flag); tuning the match logic requires a full Gold rebuild (10⁸ rows); the 1:N permit→property fanout becomes an array column without a documented grain change. | gold/Build_Zoom_Gold_Master.py:8-13, 277-469 |
MED-HIGH |
| S5 | The Roofing_* naming is a domain lie that will block multi-vertical scaling. Bronze contains every BuildZoom permit category. The labels are aspirational ("for the 8020REI Roofing vertical"), the contents are generic. Not cosmetic: a future HVAC or solar vertical forces either (a) re-ingesting the same BuildZoom data into HVAC_* tables (1× storage + 1× Hudi maintenance per vertical) or (b) accepting "Roofing_Master" as the master for everything (in which case the name is a permanent lie). Current code locks in option (a) — Step Functions stack name, CFN, QA classes all encode "Roofing". |
Whole repo · configs.py:30, 69, 112, 152, 254, 325 |
MED |
Other structural issues worth tracking
| Category | Finding |
|---|---|
| Layer leak | Properties Bronze casts FIPS, derives partitions, generates DateSnapshot — that's Silver-cleanup work done in Bronze. Silver then re-casts COUNTY_FIPS again. |
| Layer leak | Silver collapses Statuses (event-stream) to "current status per permit" via window function — that's a business aggregation that destroys history. Gold or a view should own it; Silver should expose the un-collapsed event timeline. |
| Naming | AUX_PERMIT_STATUS / AUX_EFFECTIVE_STATUS_DATE are actually "latest status" — name tells the consumer nothing. There's already a LATEST_STATUS field on permits, creating two columns with overlapping meaning. |
| Naming | S3_TABLE_PATH on RoofingGoldConfig points to a base path that never exists in S3 — real tables live at _<YYYYMMDD> snapshots. QA reads from the non-existent base path. |
| Naming | HUDI_PARALLELISM constant is used as record_size in write_table_with_bulk_insert — name says parallelism, use says size estimate. They are not the same thing. |
| Dependency design | Gold has 3 hard dependencies on tables not built by this repo (Real_Estate_Master, two Standardized_Addresses tables). If FA changes its column name PropertyId → propertyId, Gold silently breaks with empty matches. No upstream contract documented. |
| Dependency design | Bronze QA reads Bronze table state from run N−1 to compute deltas. Re-running breaks QA determinism. |
| Idempotency | Three layers, three different write policies (Bronze upsert, Silver full-refresh-upsert, Gold bulk_insert into new snapshot). There is no single "this end-to-end vintage is consistent" record. Asking "what's the latest fully-consistent (Bronze, Silver, Gold)?" requires reconciling 3 DynamoDB tables on incompatible date formats. |
| Reproducibility | No source-file fingerprint stored on any row. A Gold row cannot be traced back to "this BuildZoom CSV, line N." DynamoDB records the partition (2025-12-10) but not the file ETag, byte count, or schema hash. |
| Schema contract | Schema source of truth is a switch statement inside shared/utils.get_hudi_schema(...) — the 2853-line god module. No column-level contract: LATEST_STATUS "important" but no enum / no range / no allowed-value check. |
| Schema contract | Schema evolution policies disagree: Hudi has auto-evolution enabled with type-override to STRING; the Spark reader uses a fixed schema and drops new columns at read. Two policies, opposite directions. |
| Coupling | Silver reads Bronze via spark.read.format("hudi").load(BRONZE_PATH) with no commit pin. If Bronze for the active year is mid-commit (Hudi COW), Silver may read a half-finished file group. Step-Function serialization papers over the issue for daily runs but breaks under manual re-processing. |
| Coupling | Athena DDL string is embedded inside Gold Spark job. Renaming the catalog or table requires editing Python code. |
| Modeling gap | No SCD treatment on Properties (assessor data, slowly-changing). Hudi upsert overwrites without effective_from/effective_to — historical analysis cannot answer "what did we know about this property on date X?" |
| Modeling gap | No event view of permits. Statuses is a natural event stream (applied → issued → inspected → completed) but Silver collapses to "latest only". The medallion pattern flattens an event source into a snapshot — the opposite of best practice for lifecycle analysis. |
| Cross-cutting | shared/utils.py at 2853 lines holds Spark schemas, Hudi writer, DynamoDB tracker, S3 lister, three QA classes, Slack formatter, snapshot resolver. A god module; refactoring it requires re-deploying every layer. |
| Cross-cutting | Salting config (salting_partitions.json) lives at s3://8020rei-staging-data-lake/diego_staging/ — in someone's personal staging path. Deployment landmine. |
| Hard-coded assumption | One vendor (BuildZoom), one enrichment provider (First American, FA_PROPERTYID baked into Gold schema), one region (us-east-1 hard-coded in 3 files), one vertical, one environment. A second of any of these requires editing string literals in 5+ files. |
| Future-fragility | Step Functions chain is hard-coded serial Bronze → Silver → Gold. A second vertical needs a second state machine + second EMR cluster, or a fork. CFN doesn't support multi-tenancy. |
| Future-fragility | Snapshot deletion on next successful run means the first successful run after a bug ships overwrites the last-known-good. No rollback. |
Architectural verdict
The medallion pattern is conceptually appropriate for this domain — permit data does benefit from a raw mirror, a denormalized fact, and an enriched serving layer. But the implementation has not internalized what each tier owes the next. Bronze does Silver work; Silver does Gold work and also re-does Bronze work it shouldn't trust; Gold does platform work (Athena DDL) and business decisions (match-method priority) that should be a thin policy layer above pure data. The contracts between layers are implicit — no manifest, no version pin, no schema enum — and the moment a downstream consumer wants to ask "is the data I'm reading consistent?" they cannot.
The structural ceiling is single-vendor, single-domain, single-region naming. The codebase is shaped like a one-shot data engineering project for BuildZoom-Roofing, not like a platform. Adding a second vendor doubles the surface area. Adding a second enrichment provider requires schema changes everywhere FA_PROPERTYID appears. Adding a second region requires editing string literals in 5+ files. The Roofing_* prefix, the hard-coded us-east-1, and the _V1 suffix together signal that the team is not yet thinking in terms of a long-lived platform.
A re-design starting today would: (1) name Bronze + Silver after vendor + entity (buildzoom.permits_raw, buildzoom.permits_facts), reserve domain language ("Roofing", "Solar") for Gold serving views; (2) make the precombine key a deterministic source_file_etag + row_number, never current_timestamp(), so re-runs are bit-identical; (3) split entity-resolution out of Gold into a "match candidates" intermediate with match_method + match_score columns, and let Gold be a thin policy view; (4) replace the VERSION_SUFFIX = "_V1" pattern with a snapshot/manifest table that records {bronze_commit, silver_commit, fa_period, std_addr_commit} per run, with retention policy and rollback; (5) move shared/utils.py into 3–4 cohesive modules and treat each layer's input/output schema as a versioned contract, not a Python switch statement; (6) make Step Functions parameterized over {vendor, domain} so a second vertical is a config change, not a fork.
The medallion isn't wrong. The implementation has just not invested in the contracts between tiers — and contracts, not technology, are what makes a lakehouse maintainable as it grows.
How to read structural vs code-level findings together
The 13 HIGH bugs in §11a are tactical fixes. The 5 structural issues in §12d are strategic re-architecture. Fixing the code bugs without addressing the structural ones leaves the same shape vulnerable to the same class of bugs later. Fixing structure without code-level fixes leaves shipping data wrong today. Do both; sequence the structural changes around the natural rebuild cycle (e.g., when the team next versions the tier from_V1 → _V2).
10 · Glossary & sources
| Term | Meaning |
|---|---|
| Hudi COW | Apache Hudi Copy-on-Write table. Each commit rewrites affected files; readers see consistent snapshots. Single-commit vintages = no version inflation (we verified 1.0× on Pinellas 12103). |
| Medallion architecture | Bronze (raw) → Silver (clean, joined) → Gold (enriched, joinable). Industry vocabulary (Databricks-coined). |
| FIPS | Federal Information Processing Standards — 5-digit county identifier (always zero-padded). Used as partition key for geographic pruning. |
| FA | First-American — the property-records vendor whose PropertyId Gold links permits to. |
| FA_PROPERTYID | Array of First-American property IDs matched to a permit via the 3-condition cascade. |
| FIPS salting | Spark anti-skew technique: hash-distribute big-FIPS rows across N sub-buckets so executors share load. |
| COMPOSITE_KEY | Bronze record key for classifications: BUILDING_PERMIT_ID|PROJECT_TYPE (one permit may carry multiple project types). |
| T0 | The prediction-time boundary. Features ≤ T0, labels > T0 + horizon. Not in this ETL today; introduced in the proposed Platinum tier. |
Sources for this document
- GitHub repo:
8020rei-com/buildzoom-etl(default branchmain, cloned 2026-05-19). - Files inspected:
README.md,bronze/README.md,shared/configs.py,shared/utils.py,bronze/Build_Zoom_*.py,silver/Build_Zoom_Silver_Master.py,gold/Build_Zoom_Gold_Master.py. - Empirical observations on the 2026-05-12 Pinellas (12103) Gold vintage: 5,402,808 rows, 1 Hudi commit, 1.0× version inflation, 54 columns.
- Cross-references from this project's findings: 57 (BuildZoom naming audit), 62 (broad taxonomy), 67 (25K list false-positive audit), 72 (South-FL regex), 73 (coverage universe methodology), 74 (independent verification).