OverviewData › Data sources

Data sources

Where every dataset comes from — the access path, the grain, and the join keys for each source system the models read.

Every 8020IQ model is built on a small set of source systems. This page is the map of each one: the system it lives in, how we reach it, the grain it arrives at, and the key it joins on. It is the provenance index behind how we match permits to properties and the gold variable inventory — and the Overview's answer to "where do the data come from?".

AWS data lake — three layers

The property + permit substrate lives in S3, read-only under the 8020rei-prod / 8020rei-coo profiles. Three layers, gold → silver → bronze, in decreasing degrees of curation.

GOLD read-only

Curated, model-ready property + permit master

buckets3://8020rei-gold-data-lake/
example datasetRoofing_Master_Gold_V1_*/<FIPS>
accessprofile 8020rei-prod, R/O
grainone row per permit (per FIPS partition)
notebuilding permits live HERE — not BigQuery

SILVER read-only

First American property records, lightly normalized

datasetsUse_Type · REM/<period>/<FIPS>
accessprofile 8020rei-prod, R/O
grainone row per parcel
periodREM snapshots partitioned by <period>
joinPropertyId · situs address + ZIP

BRONZE scoped read

Raw roofing source tables (Hudi)

buckets3://8020rei-bronze-data-lake
account611... (cross-account)
formatHudi tables
accessscoped read role role name TODO
contentsroofing source tables (pre-curation)

Lake layout reference: exact partition paths — gold Roofing_Master_Gold_V1_*/<FIPS>, silver Use_Type and REM/<period>/<FIPS> — are the working strings used by the pipeline. The bronze scoped-read role name is not reproduced verbatim here; confirm against the lake-layout note before wiring a new reader. See engine transparency for how the layers feed the model.

BigQuery — feedback loop

The client-feedback loop (which properties actually transacted / were worked) is mirrored into BigQuery, separate from the S3 lake.

ItemValueNote
Projectbigquery-467404admin SA
Feedback tablefulfillment.gold_lookup_client_propertyper-property client outcome lookup
Admin SA keykeys/bigquery-admin.jsonpath only — secret never printed; gitignored

Known caveats (carry forward): the feedback zip column strips leading zeros (07002 → 7002), so any zip-join must LPAD both sides to 5 digits or every Northeast (0xxxx) row silently drops. The feedback event_date is also unreliable (future / 1972 / back-dated values) — do not use it as a clean time anchor without scrubbing.

Vendors

Two licensed vendors supply the raw property and permit signal that the lake curates.

First American (FA) licensed

Property + ownership reference data

Assessor, deed / mortgage, equity, pre-foreclosure, tax, value-history and building-permit feeds — the property substrate behind the silver layer. Field-level definitions for every FA feed are documented in the FA dictionaries (the full FA dictionary library).

grainparcel-level (PropertyId)
joinsitus address + ZIP, FIPS

BuildZoom (BZ) licensed

Aggregated building permits

Permit records aggregated across jurisdictions — the source feeding the gold permit master. The full extract-and-classify path is documented in the BuildZoom ETL walkthrough; field defs sit alongside the other dictionaries in Resources.

grainone row per permit
joinpermit address → FA via standardizer

BZ permits are building-level, so the address standardizer is what carries a permit to the right parcel — see how we match and permit classification for the type/subtype taxonomy.

US demographics cache

A free-government, nationwide parquet cache at data/demographics/, curated to property-buyer-signal variables and joinable to property records on standard zero-padded FIPS string keys. Pulled once, reused across every model and every model version — not specific to one build.

SourceGrain (geo levels)Join keyVintage · safe-T0 rule
Census ACS 5-yrBlock-Group · Tract · ZCTA · Countygeoid (bg 12-digit · tract 11-digit · zcta 5-digit · county 5-digit)2019–2023 · safe T0 ≥ 2025-Q1 (~14mo lag)
Decennial 2020Block · Block-Group (· County)geoid (block 15-digit · bg 12-digit)Apr-1-2020 PIT · safe T0 ≥ 2021-Q3
IRS SOIUSPS ZIP (not ZCTA)zip5 5-digit string2021 + 2022 tax year · safe T0 ≥ 2025-Q1
HUD CHASTract · County (+23 sub-tables)tract_geoid / county_fips (summary-level prefix to strip)2018–2022 5-yr · safe T0 ≥ 2025-Q1
HUD FMRCountycounty_fips (first 5 of 10-digit fips)FY25, forward-looking · already safe
HUD Income LimitsCountycounty_fipsFY25 (AMI + 30/50/80% × HH-size)
BLS QCEWCounty × NAICScounty_fips2024 annual · safe T0 ≥ 2025-Q3 (~6mo lag)
FCC BDCCounty · Place · CBSA · State · Tribal · National (broadband)county_fips stored INTEGER — recast to padded string before joinDec 2024 collection · safe T0 ≥ 2025-Q1

ZCTA ≠ USPS ZIP (~3–5% mismatch — bridge via the HUD ZIP↔ZCTA crosswalk). ACS uses -666666688 as a "not available" sentinel — filter it before use. For T0 below a source's safe cutoff, fall back to the prior vintage or NaN. The cache also holds many more free-government families beyond the eight above (BEA, CDC, EPA, FEMA, FHFA, NOAA, USDA, Redfin/Zillow time-series, etc.); the eight here are the load-bearing demographic tiers. This #demographics section is the deep-link target for the data manifest.

County-portal scraping

Direct-from-county source discovery to fill BuildZoom coverage gaps and add distress signal not present in licensed feeds. Stage 1 only — source discovery, no production scrape yet.

DatatypeScopeStageStatus
Permitstop-10 counties (+ top-200 ext)Stage 1 discoverynot production
Pre-foreclosure (NOD / Lis Pendens)top-10 countiesStage 1 discoverynot production
Other distress (tax-delinquent / code-violation / liens)top-10 countiesStage 1 discoverynot production
Court records (eviction / divorce / probate)top-10 countiesStage 1 discoverydefault-skip · counsel gate

Stage 1 catalogs each (county × datatype) source, proves reachability and scores viability. Stage 2 (pilot scrape) is gated on counsel review of bulk-scrape ToS + a QA pass; nothing here ships into a model today. Court records default to skip pending legal review. Cross-validation against the BuildZoom gold master is the eventual reconciliation check.

Conventions across every source

FIPS are always 5-digit zero-padded strings — render 04013, never 4013; store as string, never integer. (Several cached families ship FIPS as INTEGER and must be recast before any join.)

T0 = month-end — every feature is computed as-of the T0 month-end snapshot; nothing downstream of T0 may leak in.

Vintage / publication-lag safety — each external source has a safe-T0 floor (its snapshot date + publication lag). Below that floor, use the prior vintage or NaN — never the future-dated value.

Rendered from CLAUDE.md external-context + notes/demographics/README.md + notes/scraping/ · 2026-06-02