Data sources
Where every dataset comes from — the access path, the grain, and the join keys for each source system the models read.
Every 8020IQ model is built on a small set of source systems. This page is the map of each one: the system it lives in, how we reach it, the grain it arrives at, and the key it joins on. It is the provenance index behind how we match permits to properties and the gold variable inventory — and the Overview's answer to "where do the data come from?".
AWS data lake — three layers
The property + permit substrate lives in S3, read-only under the 8020rei-prod / 8020rei-coo profiles. Three layers, gold → silver → bronze, in decreasing degrees of curation.
GOLD read-only
Curated, model-ready property + permit master
s3://8020rei-gold-data-lake/Roofing_Master_Gold_V1_*/<FIPS>8020rei-prod, R/OSILVER read-only
First American property records, lightly normalized
Use_Type · REM/<period>/<FIPS>8020rei-prod, R/O<period>PropertyId · situs address + ZIPBRONZE scoped read
Raw roofing source tables (Hudi)
s3://8020rei-bronze-data-lake611... (cross-account)Lake layout reference: exact partition paths — gold Roofing_Master_Gold_V1_*/<FIPS>, silver Use_Type and REM/<period>/<FIPS> — are the working strings used by the pipeline. The bronze scoped-read role name is not reproduced verbatim here; confirm against the lake-layout note before wiring a new reader. See engine transparency for how the layers feed the model.
BigQuery — feedback loop
The client-feedback loop (which properties actually transacted / were worked) is mirrored into BigQuery, separate from the S3 lake.
| Item | Value | Note |
|---|---|---|
| Project | bigquery-467404 | admin SA |
| Feedback table | fulfillment.gold_lookup_client_property | per-property client outcome lookup |
| Admin SA key | keys/bigquery-admin.json | path only — secret never printed; gitignored |
Known caveats (carry forward): the feedback zip column strips leading zeros (07002 → 7002), so any zip-join must LPAD both sides to 5 digits or every Northeast (0xxxx) row silently drops. The feedback event_date is also unreliable (future / 1972 / back-dated values) — do not use it as a clean time anchor without scrubbing.
Vendors
Two licensed vendors supply the raw property and permit signal that the lake curates.
First American (FA) licensed
Property + ownership reference data
Assessor, deed / mortgage, equity, pre-foreclosure, tax, value-history and building-permit feeds — the property substrate behind the silver layer. Field-level definitions for every FA feed are documented in the FA dictionaries (the full FA dictionary library).
PropertyId)BuildZoom (BZ) licensed
Aggregated building permits
Permit records aggregated across jurisdictions — the source feeding the gold permit master. The full extract-and-classify path is documented in the BuildZoom ETL walkthrough; field defs sit alongside the other dictionaries in Resources.
BZ permits are building-level, so the address standardizer is what carries a permit to the right parcel — see how we match and permit classification for the type/subtype taxonomy.
US demographics cache
A free-government, nationwide parquet cache at data/demographics/, curated to property-buyer-signal variables and joinable to property records on standard zero-padded FIPS string keys. Pulled once, reused across every model and every model version — not specific to one build.
| Source | Grain (geo levels) | Join key | Vintage · safe-T0 rule |
|---|---|---|---|
| Census ACS 5-yr | Block-Group · Tract · ZCTA · County | geoid (bg 12-digit · tract 11-digit · zcta 5-digit · county 5-digit) | 2019–2023 · safe T0 ≥ 2025-Q1 (~14mo lag) |
| Decennial 2020 | Block · Block-Group (· County) | geoid (block 15-digit · bg 12-digit) | Apr-1-2020 PIT · safe T0 ≥ 2021-Q3 |
| IRS SOI | USPS ZIP (not ZCTA) | zip5 5-digit string | 2021 + 2022 tax year · safe T0 ≥ 2025-Q1 |
| HUD CHAS | Tract · County (+23 sub-tables) | tract_geoid / county_fips (summary-level prefix to strip) | 2018–2022 5-yr · safe T0 ≥ 2025-Q1 |
| HUD FMR | County | county_fips (first 5 of 10-digit fips) | FY25, forward-looking · already safe |
| HUD Income Limits | County | county_fips | FY25 (AMI + 30/50/80% × HH-size) |
| BLS QCEW | County × NAICS | county_fips | 2024 annual · safe T0 ≥ 2025-Q3 (~6mo lag) |
| FCC BDC | County · Place · CBSA · State · Tribal · National (broadband) | county_fips stored INTEGER — recast to padded string before join | Dec 2024 collection · safe T0 ≥ 2025-Q1 |
ZCTA ≠ USPS ZIP (~3–5% mismatch — bridge via the HUD ZIP↔ZCTA crosswalk). ACS uses -666666688 as a "not available" sentinel — filter it before use. For T0 below a source's safe cutoff, fall back to the prior vintage or NaN. The cache also holds many more free-government families beyond the eight above (BEA, CDC, EPA, FEMA, FHFA, NOAA, USDA, Redfin/Zillow time-series, etc.); the eight here are the load-bearing demographic tiers. This #demographics section is the deep-link target for the data manifest.
County-portal scraping
Direct-from-county source discovery to fill BuildZoom coverage gaps and add distress signal not present in licensed feeds. Stage 1 only — source discovery, no production scrape yet.
| Datatype | Scope | Stage | Status |
|---|---|---|---|
| Permits | top-10 counties (+ top-200 ext) | Stage 1 discovery | not production |
| Pre-foreclosure (NOD / Lis Pendens) | top-10 counties | Stage 1 discovery | not production |
| Other distress (tax-delinquent / code-violation / liens) | top-10 counties | Stage 1 discovery | not production |
| Court records (eviction / divorce / probate) | top-10 counties | Stage 1 discovery | default-skip · counsel gate |
Stage 1 catalogs each (county × datatype) source, proves reachability and scores viability. Stage 2 (pilot scrape) is gated on counsel review of bulk-scrape ToS + a QA pass; nothing here ships into a model today. Court records default to skip pending legal review. Cross-validation against the BuildZoom gold master is the eventual reconciliation check.
Conventions across every source
04013, never 4013; store as string, never integer. (Several cached families ship FIPS as INTEGER and must be recast before any join.)T0 = month-end — every feature is computed as-of the T0 month-end snapshot; nothing downstream of T0 may leak in.
Vintage / publication-lag safety — each external source has a safe-T0 floor (its snapshot date + publication lag). Below that floor, use the prior vintage or NaN — never the future-dated value.
Rendered from CLAUDE.md external-context + notes/demographics/README.md + notes/scraping/ · 2026-06-02