Demographics catalog
The free-government demographic cache at data/demographics/ — pulled once, reused across every model and every model version. This is the per-source reference.
Both models read property records (First American parcels, BuildZoom permits) and enrich them with free, nationwide, public-government context — income, vacancy, broadband, affordability, hazard, climate. That context lives in a curated parquet cache at data/demographics/, joined to property records on standard zero-padded FIPS string keys. It is not specific to one build; deleting it costs hours of bandwidth and dev time to re-pull (no auto-refresh). The plain-language permit↔property match is on Data sources; this page is the source-by-source catalog underneath it.
MANIFEST.jsonCounts vary by what is being measured: the README "Complete Inventory" reports 44 source families / 134 main parquets / 29,136,543 rows / 1.73 GB; the Wave-10 directory audit counts 52 source dirs (196 parquets incl. raw/sub-tables). Both are reported as-is — the family vs directory distinction is the difference. Vintage range across the cache is 2010–2025.
The eight load-bearing tiers
These eight families carry the primary socioeconomic, count, leverage, affordability, employment and infrastructure signal. All grains are joined on zero-padded FIPS strings (see Join keys below). Row counts and Maricopa 04013 anchor values are from the per-wave QA audits (2026-05-27/28).
| Source | Grain | Join key | Vintage · safe T0 floor | What it provides |
|---|---|---|---|---|
| ACS 5-yr (Census) | Block-Group · Tract · ZCTA · County | geoid (12/11/5/5) | 2019–2023 (live) + 2018–2022 (backfill); center ≈2021/2020 · T0 ≥ 2025-Q1 / 2024-Q1 | 139 curated variables across 30 tables — median HH income, home value, gross rent, tenure, year built, education, mobility, commute. BG = finest socioeconomic context. Maricopa MHI $85,518 (2023) / $80,675 (2022). |
| Decennial 2020 (Census) | Block · Block-Group · County | geoid_bg · fips5 (15/12/5) | April-1 2020 PIT · T0 ≥ 2021-Q3 | 100%-count population + housing-unit + vacancy. Freshest exact vacancy. Block file 8.13M rows; Maricopa county pop 4,420,568, hu_vacant 169,248. |
| IRS SOI (IRS) | USPS ZIP (not ZCTA) | zip5 | 2018–2022 (5 vintages) · TY2022 T0 ≥ 2025-Q1 | Income-distribution by ZIP — 5 AGI brackets per ZIP (multi-row). Mortgage-interest density / leverage proxy. Column is zip5 not zipcode. |
| HUD CHAS / FMR / IL (HUD) | Tract · County · ZIP | county_fips · geoid · zip5 | CHAS 2018–2022 (T0 ≥ 2025-Q1) · FMR/IL FY25 (forward-looking, already safe) | Cost-burden / affordability bands (CHAS, 23 sub-tables), Fair Market Rents (FMR), AMI Income Limits (IL), plus Small-Area FMR by ZIP. Maricopa 2BR FMR $1,950. |
| BLS QCEW (BLS) | County × NAICS · State | county_fips · state_fips | 2024 annual · T0 ≥ 2025-Q3 | Employer-side wage / employment cycle by industry. 264,923 county rows (audited). |
| FCC BDC (FCC) | County · Place · CBSA · CongDist · State · Tribal · National | county_fips | Dec 2024 collection · T0 ≥ 2025-Q1 | Broadband availability by technology / speed — infrastructure premium. County file 366,510 rows. Maricopa 100% any-tech, 92% gigabit. Plus a mobile (BDC-mobile) family. |
HMDA (CFPB mortgage applications, 2022+2023, county) appears in the README inventory alongside the eight tiers as a mortgage-origination signal — 152,065 (2022) + 141,940 (2023) county rows; county_fips 5-digit string, Maricopa present both years.
Additional families present (documented)
The cache also holds many more free-government families beyond the eight above. Listed here are only those with a spec/audit in notes/demographics/. Row counts from the README auto-inventory and per-wave audits; some sources carry documented gaps (flagged).
| Family | Grain | Join key | Vintage | What it provides · notes |
|---|---|---|---|---|
| BEA regional income | County | county_fips | 2019–2023 | Per-capita / personal income by county. 2019 has 3,113 rows vs 3,114 later (1-county delta, cosmetic). |
| CDC SVI · PLACES · WONDER | County · Tract | county_fips · geoid | 2022 (SVI), 2024 (PLACES), 2020–2023 (WONDER) | Social Vulnerability Index, local health prevalence, mortality. PLACES Maricopa 993 tracts vs SVI 1,009 (PLACES suppresses low-pop). |
| EPA EJSCREEN · Walkability · AQS | Block-Group · County | geoid · geoid20 · county_fips | 2024 (EJ/Walk), 2020–2024 (AQS) | Environmental-justice indicators, walkability index, air quality. AQS sparse (~1,030 monitored counties/yr; absent = no monitor, not clean air). |
| FEMA NRI · NFHL | County · Tract | county_fips · tract_geoid · fips5 | 2024 (NRI) | National Risk Index (natural-hazard risk) + flood-hazard layer. Use county_fips/tract_geoid, not legacy STCOFIPS. |
| NOAA storms · normals | County · Event | county_fips | 2015–2024 (storms), 1991–2020 (normals) | Storm-event history (943,100 events audited) + 30-yr climate normals. 61% of storm events were zone-based, crosswalked to counties; 154 normals counties via nearest-neighbor (has_station=False). |
| USDA RUCC · RUCA · SNAP · Food Atlas · FAR | County · Tract · ZIP | fips5 · tract_geoid · zip5 | 2010 (RUCA), 2020 (FAR), 2023 (rest) | Rural-urban continuum, commuting areas, SNAP participation, food-access atlas, frontier codes. |
| FHFA HPI | County · CBSA · State · ZIP3 | fips_code · cbsa · state_fips · zip3 | time-series | House Price Index. County annual 106,252 rows, 2,795 unique FIPS, Maricopa present. |
| Housing market — Zillow · Redfin · NAR EHS | County · ZIP · MSA · State | geoid · fips_code · cbsa · state_fips | time-series | ZHVI / ZORI / sale-price / velocity / inventory (Zillow 8,677,707 rows, 7 files), weekly market tracker (Redfin), existing-home-sales (NAR). See open issues below. |
Other documented dirs the audits also cover: Census secondary (CBP, ZBP, PEP, HVS, QWI, AHS, LODES), IRS Migration + County, IRS/ACS migration flows, SAIPE, US Drought Monitor, Tigerline (geocoding shapes), HUD ZIP crosswalk, BLM PAD-US, WRLURI, NHTSA FARS, FDIC SOD, NCES CCD, USGS NLCD, MS USBF, USPS vacancy, SBA loans, energy/minerals, DOE LEAD, FBI UCR, FRED macro.
Join keys
All keys are zero-padded strings — never integers (CLAUDE.md rule #1; 04013 not 4013). Across the 44 families the geo-key column travels under 10 different names; the canonical ones:
| Key | Format | Example | Used by |
|---|---|---|---|
| state_fips | 2-digit | 04 | all sources |
| county_fips / fips5 | 5-digit | 04013 | ACS · Decennial · HUD · BLS · FCC · HMDA · BEA · CDC · EPA · FEMA · property records |
| tract_geoid | 11-digit | 04013010101 | ACS · HUD CHAS · CDC · FEMA NRI tract |
| bg_geoid / geoid | 12-digit | 040130101011 | ACS · Decennial · FCC · EPA EJSCREEN/Walkability |
| block_geoid | 15-digit | 040130101011001 | Decennial · FCC |
| zcta | 5-digit | 85003 | ACS only |
| zip5 / zip_usps | 5-digit | 85003 | IRS SOI · HUD SAFMR · Census ZBP · USDA |
Vintage-safety + join gotchas (no future leak)
Features must be computable from the T0 snapshot alone (CLAUDE.md rule #2). Each source publishes with a lag, giving it a hard T0 floor below which using it is future leakage. Below that floor → fall back to the prior vintage or NaN. Safe-T0 floors (from README):
| Source | Snapshot | Pub lag | Safe T0 ≥ |
|---|---|---|---|
| ACS 5-yr 2019–2023 | rolling avg ~2021 | ~14mo | 2025-Q1 |
| Decennial 2020 | April-1 2020 PIT | ~18mo | 2021-Q3 |
| IRS SOI 2022 | tax year 2022 | ~24mo | 2025-Q1 |
| HUD CHAS 2018–2022 | ACS derivative | ~36mo | 2025-Q1 |
| BLS QCEW 2024 | calendar 2024 | ~6mo | 2025-Q3 |
| FCC BDC Dec 2024 | Dec 2024 collection | ~4mo | 2025-Q1 |
| HMDA 2023 | calendar 2023 | ~9mo | 2024-Q4 |
Four gotchas to apply before any join
ZCTA ≠ USPS ZIP. ZCTA is the Census polygon approximation of a mail ZIP, not 1:1. ~3–5% of property records have a USPS ZIP with no matching ZCTA, and ~2% of ZCTAs cross county lines. Bridge via the HUD ZIP↔ZCTA crosswalk. Prefer block-group / tract joins when lat/lon is available; only fall back to ZCTA when you have just a ZIP.
ACS sentinel -666666688. Census encodes "estimate not available" as this numeric sentinel — the parquets retain it raw (NaN-normalization only covers textual nulls). Filter it before use: df['mhi'].where(df['mhi'] > 0). Example: BG 040130101021 has b19013_001_e = -666666688.
FCC county_fips INTEGER recast. The original VALIDATION_REPORT flagged 12 FCC parquets storing county_fips as INTEGER. County-level files are now fixed to zero-padded VARCHAR; non-county grains carry the column as 100%-null (correct — those grains have no county key). The remaining FAIL is fcc_bdc_mobile provider_summary provider_id as int64 — but that is an FCC identifier, not a FIPS, so cast-on-read suffices.
Prefixed / multi-row keys. HUD CHAS geoid is summary-level-prefixed (1400000US04013010102 — strip with SUBSTRING(geoid, 10) or LIKE); HUD FMR county FIPS is the first 5 chars of a 10-digit fips; IRS SOI is multi-row per ZIP (5 AGI brackets — aggregate or join on (zip5, agi_bracket)); Tigerline tract county_fips is 3-digit (use geoid[:5]).
Known gaps & status
The cache is curated, not finished. Honest open items from the QA audits:
| Item | Status | Detail |
|---|---|---|
| Redfin ZIP weekly | FAIL — empty | 0-row parquet; ETL never wrote data. Raw TSV (1.5 GB) exists; re-ingest required. |
Redfin county fips_code | FAIL — 99.8% null | Only 4 FIPS populated; region_name + state_fips + cbsa usable as workaround pending FIPS re-derive. |
| USGS NLCD county | FAIL — partial | 2,287 of ~3,143 counties (856 missing); re-pull required. |
| MS USBF | FAIL — stub | 16-row AZ+DC stub; national pull never completed. |
| ACS BG 34 dead columns | WARN | 34 cols 100% NULL at BG (B07001/B08006/B17001/B19083/B25097 — not published at BG). No data loss; schema bloat only. |
| FBI UCR · FRED macro | WARN — empty dirs | Spec + script exist, zero parquets. planned |
Overall validation across the cache: 115 PASS · 6 WARN · 13 FAIL (VALIDATION_REPORT totals). The 13 FAILs are the 12 FCC INTEGER-FIPS files (since recast) plus the empty Redfin ZIP. Tier-F (local market context) wiring of these into the REI feature builder is in progress — three model features (listing_duration_months, months_since_prev_sale, mortgage_age_months) remain under leakage audit and are unvalidated under audit.
Where this connects
Summary card and deep-link target: Data sources → US demographics cache. The property-record side (how permits match to parcels) is on the same page. The geocoding helper that resolves lat/lon → block-group geoid for these joins lives at src/new_model/geocoding.py; the full per-source specs at notes/demographics/<source>_spec.md and the cross-source manifest at data/demographics/MANIFEST.json.
Rendered from notes/demographics/ (README + DATA_CATALOG + specs + audits).