OverviewModelsRoofing › Changelog

Roofing pipeline · Progress Notebook cuaderno · single source of truth · 9-step MECE pipeline · 2026-05-20 rev2 (Step 1 restructure + dev-layer)

This is the only notebook. It is the macro / decision layer for the roof-replacement pipeline — every box below has a status pill and a link to its detailed Markdown spec. Detail lives in Markdown (for AI agents). Decisions live here (for humans to read and steer). The pipeline is 9 MECE steps: Labeling → Coverage → Enrichment → Folds → Features → Training → Calibration → Output gate → Deploy. Steps 1-2 are active workstreams; Steps 3-9 spec'd and audited 2026-05-21 (per-step + integration) — implementation not started. 2026-05-20 rev2: Step 1 restructured into 1.1 (variable inventory) + 1.2 (type+subtype MECE) — see ADR.
📋 Rules Reference — Step 1 & Step 2, in plain language
How a permit becomes a permit_scope label · how FA & BuildZoom municipalities are cleaned into one canonical key · the 4-gate coverage decision tree — all visual, all clear.
Open Rules Reference →
🏠 Engagement deck — CallZeke Roofing · Hernando · Pasco · Pinellas
The 3-county engagement: coverage answer (497,856 SFH / 98 % in a covered jurisdiction — jurisdiction-membership, not real visibility; honest caveat inline), the full 9-step methodology walkthrough with 5-agent audit verdicts, and the ranked 15K list. Production state (commit a31bdbc, 188 feat post 6 structural fixes + insurance MVP): model-quality lift 6.88× fold-1 / 6.44× cross-fold mean (range 5.88–6.86×); deliverable lift (after 50/25/25 quota) 5.66× / 1,654 caught / 11.03 % precision (174 feat after synth_v4 drop, parsimony win). Status: HOLD per 4-of-4 external AI reviewer consensus (round 4) — re-anchor to 2026-05-31 + renegotiate quota required before shipping to CallZeke.
CallZeke list audit (#97) →
⚠ Round-4 external AI review (2026-05-22 PM, 4 independent agents): HOLD. 4-of-4 reviewers (ChatGPT + Gemini + Claude + others) converged on the same verdict after reading the v5 FULL dossier (DOSSIER_v5_FULL_for_external_ai.md, 55 KB self-contained):

2 ship-blockers: (1) Stale 2025-05-31 anchor — the 6-month forward window (Jun–Nov 2025) has already transpired; SDRs would call homeowners who may have already reroofed in late 2025; crosses the SB-76 / Helene / Milton regime change. Re-score at 2026-05-31 required (AWS-cost gated, REGLA DE ORO). (2) Quota cannibalization — 50/25/25 county allocation costs -1.31× lift, -373 caught positives. Pinellas is 67 % tile/metal (35–50 yr life) while Pasco is 99 % asphalt (15–25 yr); model correctly ranks Pinellas low but quota forces 1,939 below-natural-threshold rows in. Renegotiate with CallZeke before shipping next deliverable.

New critical/major findings to act on (quick wins, ≤1 day each):
  1. Action-zone calibration (top-1k/5k/15k + per-county + per-material) — global ECE 0.0003 hides top-decile -6.7 % relative under-call
  2. Triple-baseline lift report (full-pop / gated-pool / age-cohort-matched) — gate alone delivers ~1.75× of headline 7.0×, model-on-top is ~4.0×
  3. 10-seed insurance-wall validation — single-seed +0.41× may be Milton-driven, cross-fold +0.04× still within σ
  4. Block bootstrap CI on +0.50× cumulative gain — folds correlated, effective n < 6
  5. Jurisdiction match-rate as explicit feature — closes gate-fallback circularity (reviewer A3)
  6. Synth_v4 ablation given v5 material-window — test for age-cohort leakage
  7. HOA fee residualization on AVM + zip-income — disentangle wealth proxy vs roof signal
  8. Soft-label NA at weight=0.5 — definitive close of D21 / Q20
  9. Material-aware quota simulation — recovers est. +0.3-0.5× of the 1.31× quota cost

What reviewers said NOT to do: defer full Citizens depopulation external data until 10-seed validates insurance-wall MVP; kill Q19 (FEMA reclass / 4-point inspection chasing); kill per-storm re-scoring; do not drop fold-6 from cross-fold mean even though it shows brittleness (it's the honest stress test for distribution shift, 4 of 4 reviewers agree).
🌙 Autopilot overnight + 🔍 triple-audit follow-up — 2026-05-22 (earlier in day). Mandate: "audit audit audit, scientific method, get BEST model" → "now challenge with domain expert + Apollo cross-pollination lens." Outcome: new production stack = v3_5layer + synth_v4 (subject-exclusion bug fixed) + synth_v5 + POS=REPLACEMENT (199 feats), cross-fold mean lift 6.51× (range 5.78-7.04×), fold-1 anchor 7.1× on gated ranked-15k list. +0.55× vs the pre-autopilot 5.96× baseline (2 iteration cycles total, +9 % relative).

Triple agents synthesised: (1) DS expert found a synth_v4 subject-self- contamination bug + collinearity in roof-age cluster; (2) FL roofing domain expert called out insurance-renewal signals + sign-flipped distress composites; (3) Apollo cross-pollination identified Tier F micro (BLS+ACS+ FHFA) and V2.1 regime interactions as free wins. Acted on: synth_v4 fix + synth_v5 layer (sign-flipped distress composites, material-adjusted replacement window 15-25 / 35-50 / 40-60 / 12-20 yr by cover code, Apollo-port equity/absentee features) shipped. Tier F micro built + tested + DROPPED (0 features in top-40 — Apollo a-priori warning confirmed). New top-2 by gain: in_replacement_window_by_material (#1, 13.8 %) and pct_of_useful_life (#2, 8.8 %) — clean substitution of the asphalt-hardcoded version.

Master reports: triple_audit_consolidated.md · autopilot_overnight.md · master_summary.md · v5_champion_6fold.md.
🔬 10-agent Opus 4.7 audit + cross-anchor re-verification — 2026-05-22 (autonomous weekend run, later in day). Mandate: "iterate until 10 agents agree" → re-anchored to 2025-10-31 produced fold-1 lift@15k 9.24×. 10 parallel agents (Critic, Model QA, Reality Checker, Scientist, Architect, Code Reviewer, Security, Verifier, Tracer, Doc Specialist) dispatched. 8/10 converged on HOLD verdict: the 9.24× is materially inflated by right-censoring + base-rate arithmetic, not a real model improvement.

Empirical re-verification. Same 174-feature v3 stack re-evaluated at 3 fully-observed anchors (forward windows fully covered by gold 2026-05-12 vintage):
T0 anchorforward windowbase ratelift@15krecallprec
2025-10-31 (right-censored)2025-11..2026-041.21%9.24×27.8%11.2%
2024-11-30 (fully observed)2024-12..2025-053.27%6.53×19.7%21.4%
2024-05-31 (fully observed)2024-06..2024-112.56%5.99×18.0%15.3%
2023-11-30 (fully observed)2023-12..2024-052.08%6.61×19.9%13.8%
3-anchor (fully-observed) mean: 6.38× ± 0.34 — essentially equal to the prior 6.34× cross-fold mean. The 9.24× was real ON the eval but the eval was biased: base rate deflated 1.21% (vs ~2.6% norm = ×1.6 lift inflation by mechanics) + Hernando 2026-Feb/Mar permits truncated (116 / 91 vs ~500/mo baseline). Pasco + Pinellas aggregate stayed flat → only Hernando is right-censored.

Convergent BLOCKERS surfaced (will-fix list):
  • B1 · Stale county multipliers (1.127 / 1.212 / 0.729 hardcoded from 2025-05-31 anchor → applied to 2025-10-31 ranking) — flagged by Model QA, Architect, Verifier.
  • B5 · Confirmed leak: canceled_roof used LATEST_STATUS string equality (current status, not as-of-T0) — flagged by Security. FIXED 2026-05-22 via LATEST_STATUS_DATE ≤ T0 gate in build_synth_v3.py:265-300. Expected lift drop: small (canceled_roof's marginal contribution was modest).
  • B6 · is_listed 100% identical across two T0s 6mo apart (497,692 / 497,856) — REM single-snapshot leakage suspected. Under investigation.
  • Clobber bug: v3_metrics.json + fold1_scored_v3.parquet overwritten by parallel anchor runs. FIXED 2026-05-22: filename now embeds EVAL_T0 when RUN_TAG empty.
  • 5 sidecars missing at 2025-10-31: equity_macro, micro_local, synth_coverage, synth_v4, temporal_distress. SKIP_LAYERS default protects training, but explicit rebuild required if those families are ever revived at this anchor.
Recommendation: hold the 6.38× mean as canonical headline. The 2025-10-31 anchor stays useful as a forward-looking projection but should NOT be cited as evidence of a +3× model jump. Re-derive county multipliers at the canonical anchor before re-publishing per-county probabilities.

Master report: 2026-05-22_10agent_audit_synthesis.md · data: data/sandbox/model/cross_anchor_audit_2026-05-22.json.
✅ 2026-05-22 ROUND-3+4 CONSENSUS — final headline locked at n=6 anchors. After 6 fix cycles (canceled_roof leak · clobber bug · Y/N letter-code dead-feature · flood-zone dead-feature · multipliers anchor-parameterized · circular-validation leak via temporal holdout) + expansion from n=3 to n=6 fully-observed anchors:

Cross-anchor mean lift@15k = 6.66× ± 0.33, 95% CI [6.31, 7.01] (t df=5). Six anchors all post-2022, all fully observed by gold vintage 2026-05-12: [7.08, 6.64, 6.07, 6.74, 6.80, 6.62] at T0 ∈ {2025-05-31, 2024-11-30, 2024-05-31, 2023-11-30, 2023-05-31, 2022-11-30}. Critically the 95% CI does not cross 6.0× — the model is statistically distinguishable from a 6× baseline.

Per-county lifts (raw, no quota): Pasco 6.76× / Hernando 5.42× / Pinellas 4.29×. One-way ANOVA F=14.45, p≈0.005 — the per-county gap is REAL, not noise (driven by roof-material lifespan differences: asphalt 15-25yr in Pasco vs tile/metal 35-50yr in Pinellas).

Production deliverable: callzeke_ranked_15k_v3.csv scored at SCORE_DATE=2025-05-31 (most-recent fully-observed anchor; lift 7.08×). Per-county multipliers loaded from data/sandbox/calibration/county_multipliers_2024-05-31.json — fit at strict temporal holdout (forward window observed by Dec 2024, applied 5 months later) to eliminate the circular-validation leak Security R3 found.

Round-3 + R4 agent verdicts (final):
  • Critic R3: SHIP-WITH-RESERVATIONS · Model QA R3: SHIP (conditional) · Reality Check R3: SHIP-OK
  • Scientist R3: STATISTICALLY VALID · Architect R3: PRODUCTION-READY · Security R3→R4: LEAK CLOSED
  • All 6 round-3+ agents SHIP. All 10 round-1 agents' concerns addressed by R2/R3 fixes.
Commits: dffcb7a · f8109e0 · aa09475 · 3b50764 · ceb9802 · 21e3a1f · 838d432 · 667153e. Master report: per_county_lift_postYN · county_material_calibration (holdout fit).
🔬 2026-05-24 multi-seed validation + transferability infrastructure.

1. Multi-seed 3-seed × 8-anchor canonical replaces single-seed headline. Rerun all 8 anchors at seeds 42 and 1337 (originals were 20260521). Per-anchor 3-seed means:
T0seed=1seed=42seed=13373-seed meanwithin-SD
2025-05-31 (prod)7.087.978.077.710.55
2025-02-287.637.647.717.660.04
2024-11-306.647.317.427.120.42
2024-08-316.726.616.626.650.06
2024-05-316.076.106.146.100.04
2023-11-306.746.886.776.800.07
2023-05-316.807.076.846.900.15
2022-11-306.626.156.476.410.24
Canonical: 6.92× ± 0.56 (95% CI [6.45, 7.39], t df=7). seed=1 was 0.13× LOW. Within-anchor seed-noise SD ≈ 0.20×. p vs H0(mean=6.0)=0.0024; p vs H0(mean=7.0)=0.70. Honest framing: "~7× lift @ 15k with 95% CI [6.45, 7.39]".

2. National-county transferability infrastructure SHIPPED, sanity-test BLOCKED.
  • SHIPPED: T0_ONLY env var (17 builders) · model.save_model (.lgb + .feats.json) · score_county.py · COUNTIES env-var refactor · Maricopa silver_rem 2022-05..2026-04 pulled from S3 (~5GB).
  • BLOCKED: Maricopa coverage_decisions has only 27 FLAG rows (no INCLUDE) → enrichment universe = 0 properties → can't build/score. Need coverage pipeline run for Maricopa (separate workstream) OR INCLUDE-bypass env that uses all FA SFH from silver_rem.
  • FIPS naming gotcha: Maricopa stored as "4013" (4-digit) in S3 + coverage_run/silver/ but "04013" (5-digit zero-pad) elsewhere. Resolved via _COUNTIES_NAMES dict in build_enrichment.py but still needs coverage data.
  • FL enrichment at 2024-11-30 was BACKED UP + RESTORED during the attempt (no contamination of canonical artifacts).
Recommendation: defer national sanity to a Coverage-team sprint; or add INCLUDE-bypass env var for sanity-only use (2-3 hr work). Multi-seed validation alone delivered the headline upgrade; national sanity remains unblocked once coverage data lands.

Master reports: multiseed_validation · national_sanity_blocker. Commits: cd3387a (multi-seed + COUNTIES env) · 62310fc (T0_ONLY refactor) · 0af7b68 (model save + score_county.py).
✅ 2026-05-25 v4 SHIPPED — pipeline exclusions enforced · Action Plan v1 wired · exact 15,000.

v9 SHIPPED (2026-05-25 23:48 · cycle 5 · 6-county Path A model): Retrain landed in <30 minutes (much faster than 10h estimate · prior sidecars built earlier). Output model_v3_fl6_pathA_full.lgb · 189 features (synth + behavioral + baseline). Per-county AUC: Hernando 0.852 · Pasco 0.884 · Pinellas 0.714 · Duval 0.834 · Orange 0.854 · Sarasota 0.834. Lift@15K on full 6-county pool: 9.03× vs v8 baseline 8.26× = +9.3% improvement. Caught 3,721 / 15K vs v8's 3,405 = +316 reroofs. Isotonic-calibrated scored parquet at fold1_scored_v3_fl6_pathA_full_calibrated.parquet. v9 client CSV at data/sandbox/model/callzeke_PROD_15k_2026-04-30_v9_client.csv (15,000 × 30, all R1-R7 + caps preserved). v8 ∩ v9 overlap = 70.4% (4,435 row swap from re-rank). ETA was ~10h. Post-train: isotonic calibration on top 50K + per-county AUC emission + re-score CZ counties at PROD anchor 2026-04-30 → v9 client CSV. Expected +5-15% lift over v8 (per Path A Maricopa precedent: 189 → 287 feat = +0.072 AUC).

v8 CURRENT SHIP (2026-05-25 cycle 4 · 6-county 189-feat + all R1-R7 + caps fixed): data/sandbox/model/callzeke_PROD_15k_2026-04-30_v8_client.csv (15,000 × 30, exact quota, all 11 ship-blockers from manual audit closed). DM/CC caps now actually enforce (was broken in v5/v6). R5/R6/R7 wired into pipeline (recent roof from gold · mailing opt-out · brand-new cutoff at 2021). +14.6% reroofs caught vs v5 (6-county model) + 2,058 audit-driven drops backfilled from scored pool.

v6 (superseded): data/sandbox/model/callzeke_PROD_15k_2026-04-30_v6_client.csv (15,000 × 30 cols, client-ready) · internal v6_full.csv (15,000 × 43 cols with all flags). v5 superseded after 6-county Path A retrain (FL+Orange+Duval+Sarasota) demonstrated **+14.6% catch improvement** on the same 3 CallZeke counties at backtest @ 2024-11-30 (2,668 → 3,058 reroofs caught at the 15K quota). Pinellas was the biggest beneficiary (+37% lift), Pasco +13%, Hernando +8%. 5,036 rows (33.6% of list) re-ranked vs v5; tier composition essentially identical.
v3 (14,788 rows, dropped post-hoc) is superseded; v4 oversamples the scored pool and backfills so the final 15,000 is exact AFTER all hard exclusions.

Hard exclusion rules (now part of pipeline · enforced before quota slice): - R1 SFH-only (drop 199 non-SFH) - R2 mailable: non-blank MAILING address+city+state (drop 174) - R3 EMV > 0 (drop 155 — Pinellas FA blanks) - R4 USPS-deliverable via Smarty DPV ∈ {Y, S, D} (drop 161) - From 27,000 oversampled pool → 26,447 survived → quota-sliced to 15,000.

Action Plan tiering (replaces hardcoded "30 days" constant): 1,354 High (30d) · 4,647 Medium (60d) · 8,999 Low (90d) after demoting 149 vacant/military/recent-roof rows. 12-week cadence schedules per tier. CC capped 1 per owner (14,843 unique owners) AND DM capped 2 per mailbox (14,696 unique targets). Total touches: 35,726 DM + 2,682 CC.

Spec + pipeline: Action Plan System (visual + flowchart) · finding #79 · builder scripts/roofing/build_15k_with_exclusions.py · filters lib scripts/roofing/lib/pipeline_filters.py · Smarty client scripts/roofing/lib/smarty_client.py (with JSONL cache → free re-runs).
🔍 Replacement Profiles — who actually replaces their roof? (2026-05-25)

Empirical analysis of 362,643 gold REPLACEMENT events across the 3 CallZeke counties (Hernando · Pasco · Pinellas) joined with FA silver. Answers the question "is it unlikely for a roof to be replaced under 15 years?" → No — 9.93% of all replacements happen on roofs <15yr old (storm damage, insurance claims, premature material failure). Cutoff for the client list was loosened from "<15yr drop" to "<5yr drop only" (R7 max_year_built=2020) based on this finding.

Age at replacement (empirical · 362,643 events)

Age bracketEvents% of allInterpretation
0–5 yr21,3305.97%Mostly data artifacts + new-build permit re-issuance
5–10 yr3,5510.99%Premature failure / storm-forced
10–15 yr10,6092.97%Coastal humidity + hurricane forced
15–20 yr44,72712.52%Sweet spot for asphalt-shingle
20–25 yr39,17710.96%Late asphalt cycle
25–30 yr26,5997.44%Mid-life tile / heavy asphalt
30–40 yr67,11618.78%Tile cycle peak
40–60 yr99,86127.94%Late tile + older stock first-replacement
60–120 yr44,39912.42%Vintage stock modernization

Six personas

#PersonaN%Median (age @ replace · year built · EMV · sqft · LTV)
P1Classic FL Owner-Occupier · Asphalt Cycle
15–25yr asphalt · owner-occupied · individual
21,0465.89%18yr · 1992 · $268K · 1,640 sqft · 21%
P2Tile Roof · Late-Life Cycle
25–40yr tile/concrete · higher-dollar replacement
35,84810.03%31yr · 1982 · $341K · 2,090 sqft · 13%
P3Storm-Damage Early Replace (Irma cohort)
<15yr roof · replaced 2017–2019 · forced
3,2050.90%9yr · 2007 · $245K · 1,830 sqft · 35%
P4Institutional Investor Portfolio
LLC / Trust / Corp · OOS mailing · scheduled cap-ex
37,10610.38%26yr · 1988 · $228K · 1,520 sqft · 22%
P5Aging Affordable Stock · Long-Overdue
30–60yr · EMV <$200K · owner-occupied · deferred maintenance
28,4707.97%42yr · 1971 · $134K · 1,210 sqft · 31%
P6High-Value Pre-Listing Upgrade
5–15yr · EMV >$400K · low LTV · prep-for-sale
4,1941.17%10yr · 2010 · $580K · 2,470 sqft · 24%

Implications for the list

  • Sweet spot is 15–30yr (P1 + P2 + first half of P5 ≈ 30% of all replacements). The model already weights this via roof_age_months_est.
  • P3 storm-damage (sub-15yr) is real but small (0.9%). Loosened R7 cutoff to 5yr not 15yr — drops only brand-new homes.
  • P4 institutional (10%) = scheduled cap-ex, not lead-driven. CC outreach is wasted on portfolio managers. Recommend DM-only for these.
  • P6 high-value (1.2%) = rare but high-LTV roofer's dream. Worth a separate tier in future versions.
Full tab + per-county tile rendering: CallZeke deck → Replacement Profiles tab · script scripts/roofing/audit_replacement_profiles.py · machine-readable JSON notes/Roofing/audits/2026-05-25_replacement_profiles.json.
✅ 2026-05-24 SESSION CLOSE — Production deliverable shipped + R5 RETRACT.

Production list shipped: data/sandbox/model/callzeke_PROD_15k_2026-04-30_quota_50_25_25.csv (superseded by v4 on 2026-05-25 — see panel above) - SCORE_DATE = 2026-04-30 (forward window May-Oct 2026 — predicts NEXT 6 months) - Quota: Hernando 7,500 / Pasco 3,750 / Pinellas 3,750 (50/25/25 strategic intent) - Eligible pool 217,121 (post 15-yr recent-roof gate) - Score range 0.20-0.92, median 0.32 - Expected 32% recall, 4.70× lift, ~1,940 catches (per 2025-05-31 backtest)

🚨 R5 RETRACT (national transferability): Critic R5 caught a smoking gun — naive year_built ASC ranking catches EXACTLY the SAME 11 Maricopa positives as the 189-feature FL-trained model. Model adds ZERO signal over trivial age-sort on Maricopa. Transferability claim RETRACTED in both audit MD and CALLZEKE deck. Honest framing: model is FL-validated; non-FL transferability requires naive-baseline tests in each market (only Maricopa tested so far → failed; Harris pending).

FL signal IS REAL: the same naive-baseline test on FL @ 2025-05-31 shows model beats naive year_built by +369% to +1,132% across all quotas. The model is NOT just an age heuristic in FL — it's a genuine 189-feature ranker that uses material, refi propensity, building condition, storm exposure, and per-county calibration to find replacement-ready properties.

Prior 25K (2026-04 DM) overlap with new 50/25/25 prod list: - 3,088 of 25,000 (12.4%) — model rebuilds list from scratch vs Apollo era - Hernando: 2,889 / 16,519 (17.5% kept) - Pasco: 165 / 6,935 (2.4% kept) - Pinellas: 34 / 1,546 (2.2% kept) - Median current rank of prior list = 273,729 (54% of pop = essentially random)

Score @ recall thresholds (eligible pool): - 50% recall: score ≥ 0.269, list size 24,273 (11.0% of pool) - 70% recall: score ≥ 0.178, list size 46,191 (21.0% of pool) - 80% recall: score ≥ 0.134, list size 65,084 (29.5% of pool) - 90% recall: score ≥ 0.098, list size 95,530 (43.4% of pool)

15K list caps at ~32% recall mathematically. If client wants 80% recall they need a 65K list.

Session totals (2 days, 2026-05-22 → 2026-05-24): 33 commits · 5 audit rounds (10 + 5 + 5 + 5 + 1 agents) · 6 bug fixes shipped (canceled_roof leak, clobber, Y/N dead-feature, flood-zone dead-feature, circular-validation leak, dead-code) · 3 framework refactors (T0_ONLY env, model save + score_county.py, JSON multipliers).

Master reports: 15k_list_quota_analysis · national_sanity_3county (R5 RETRACT) · SESSION_HANDOFF.md (full handoff). Final commits: 4ba5f25 · 183c7ee · 816edda · 574d07f · ef81b7e · f242a9d · 9a6f0c5 · d8ec057.
🔬 2026-05-24 multi-seed + transferability infrastructure. Two upgrades shipped: (1) 3-seed cross-anchor validation replacing the single-seed headline; (2) per-county transferability infrastructure (T0_ONLY env + saved model artifact + score_county.py) so future runs can test the model on new geographies without retraining.

3-seed × 8-anchor canonical headline:
T0seed=1seed=42seed=13373-seed meanwithin-SD
2025-05-31 (production)7.087.978.077.710.55
2025-02-28 (Helene post)7.637.647.717.660.04
2024-11-306.647.317.427.120.42
2024-08-31 (Helene in)6.726.616.626.650.06
2024-05-316.076.106.146.100.04
2023-11-306.746.886.776.800.07
2023-05-316.807.076.846.900.15
2022-11-306.626.156.476.410.24
3-seed mean lift@15k = 6.92× ± 0.56 (95% CI [6.45, 7.39], t df=7). seed=1 (the original headline source) was 0.13× LOW. Within-anchor seed-noise SD ≈ 0.20× — meaningful relative to anchor variance. Storm-affected anchors (2025-05, 2024-11) show highest seed sensitivity (SD 0.42-0.55).

Statistical tests: p vs H0(mean=6.0) = 0.0024 (CI excludes 6×). p vs H0(mean=7.0) = 0.70 (model statistically INDISTINGUISHABLE from 7× lift). Honest framing: "~7× lift @ 15k with 95% CI [6.45, 7.39]".

Transferability infrastructure shipped:
  • T0_ONLY env var across 17 builders — filter anchor iteration to a single T0 for cheap per-county sanity work.
  • Model save in retrain_with_v3.py — writes model_v3_<TAG>.lgb + .feats.json per run.
  • score_county.py — loads model + scores a new universe (different COUNTIES) without retraining; computes lift against gold-derived labels.
  • Maricopa silver_rem pulled from S3 (all 10 anchor periods, ~5GB).
  • National sanity Maricopa-FL transfer test: running at this writing (background bqdkk8ags).
Master reports: multiseed_validation · national_sanity_blocker (now unblocked). Commits: cd3387a (multi-seed + COUNTIES env) · 62310fc (T0_ONLY refactor) · 0af7b68 (model save + score_county.py).
All 9 steps RUN end-to-end for the 3-county engagement (2026-05-21). Steps 1-3 + 4-9 executed on real pulled data (gold + silver, download-only): 8.4M permits labeled → 497,856 SFH enriched as-of T0 (leakage-clean) → walk-forward folds (1:9 case-control) → global LightGBM (top feature: roof age) → log-odds calibration (reliability monotonic, observed≈predicted) → recent-roof + VENDOR_BLIND gate → backtest. Result: ranked 15,000-home list at 5.5× lift over random, ~5× consistent across all 6 folds. Per-step audits: folds · model · calibration+rank · independent triple-audit (2026-05-22): folds→train→calibrate→rank + leakage — verdict SHIP. F4+F10 ablation (2026-05-22): closed both — dropping NA-as-positive raises lift +0.37× to 5.65× (clean-label story, not inflation); dropping §7 features Δ=+0.02× (noise, non-load-bearing). Feature audit: 58-feature catalog — all internal (FA static + REM + Gold permits). Synthetic-features feature engineering wave (2026-05-22): 5-layer stack lifted fold 1 from 5.28× → 6.47× (+1.19×, +23% relative). Layers: (1) synth +22 cols → 6.10× (NOAA hurricane + SUBDIVISION-or-ZIP neighbour density + business-sense flags + value ratios). (2) permit_context +27 cols → 6.13× (repair history + 11 adjacent trades 24/36m; only 2 of 27 cracked top-30, repair signal too sparse). (3) recency+distress +22 cols → 6.32× (tight 3/6/12m bands + months_since_last_X_permit per trade + distress segmentation). (4) synth_v3 +39 cols → 6.47× (cleaned JOB_VALUE / REM extras: RoofCoverCode, BuildingConditionCode, BuildingQualityCode, HOA1FeeValue, FA propensity scores, flood-zone / IBTrACS-2024 NOAA-v2 with parcel-grain lat/lon — surfaces Helene + Milton 2024 storms / cancelled-roof-permit signal / out-of-zip mailing). v3 winners: roof_cover_code #4 (gain 58K, beat fips), nearest_storm_name+_km #14-15, building_condition_code #19, hoa1_fee_value #23. Key learnings: #1 by gain is is_in_replacement_window (15-25yr bool); RECENCY (continuous months_since) beats COUNT (n_permits_Nm) decisively — bucketed counts dropped out of top-35 once continuous recency added; distress flags carry near-zero signal for 6-mo reroof in 3-county sample (sale/financial/net segmentation all failed). F9 closed (retrain w/ cap=1500 lost lift — baseline cap=600 correct). 2 MEDIUM still open (F5 static INCLUDE→as-of-T0, F18 50/25/25 allocation cost). Caveats: 3-county only, no hurricane Tier F' yet, scores at the 2025-05 anchor.

Why labeling before coverage DECIDED 2026-05-20

Coverage = "what fraction of SFH stock has at least one valid roof permit?" The numerator depends on the definition of "valid roof permit" — that's labeling. So labeling has to be locked first; coverage is downstream.

Full ADR (rationale, reversal cost, what was deliberately not touched): decisions/2026-05-20_labeling_before_coverage.md.

AWS operating rules — read before any AWS call LOCKED 2026-05-20

Every step that touches AWS (Step 1.1 national rerun · Step 6 training · Step 9 deploy + monitor) must follow the operational context the platform team set. Hard requirements:

  • Tag every resource with Project=ClaudeCode-Ignacio (local override of original Proyecto=Roofing-Ignacio — generic tag used across all Claude Code work, not just Roofing).
  • Partition pruning obligatorio en dev — filtra por FIPS (+ Period for Silver) ANTES de cualquier read amplio. Verifica con df.explain() que aparece PartitionFilters en el plan físico.
  • Clone existing EMR templates (`EduDS` / `DiegoDE`) en vez de crear desde cero. Familia base r5 / r6g · instancias más caras requieren confirmación explícita.
  • NO TOCAR Security Groups · VPC · subnets · NACLs · VPC Endpoints · S3 bucket policies. 95 % de problemas "de red" son IAM o path.
  • REGLA DE ORO sobre todo eso: antes de cualquier billable call, confirmar con Ignacio. Si es chico, correr local en main o mini.

Doc completo (Hudi paths · skew handling · Lamborghini rule · agent behavior contract): notes/Roofing/aws_operational_context.md.

Dev-layer strategy — Layer A / B / C LOCKED 2026-05-20

All Step 1 iteration runs on local raw Gold to keep AWS cost bounded. Three layers, two confirmed AWS pulls, then unlimited local iteration.

LayerFIPSLocal pathSizeUse
A · fast local5 counties (FL+AZ+PA+TX+JAX): 12086 · 04013 · 42101 · 48201 · 12031 + Pinellas 50K sample (12103)data/sandbox/roofing_audit/gold_<FIPS>/2.7 GBSub-second keyword tweaks · MECE assertion
B · variability10 stratified FIPS across NE/MW/S/W census regionssame dir~3–8 GBCross-AHJ keyword drift · catches Pinellas-blind rules · keeps local iteration fast
C · national-localAll 1,421 in-coverage FIPSsame dir (streamed per-FIPS)~50–150 GBFinal pre-prod replay · zero AWS after pull

Maricopa stored as 4-digit (4013/) in S3 — local naming is canonical 5-digit (gold_04013/). Hardware: M3 Pro · 18 GB RAM · ample disk. Layers A+B pulled — 31 FIPS on disk (2026-05-20). The classifier's fast-iteration subset (DEV_FIPS in classify_v5.py / audit_v5.py) is 10 counties. ADR: decisions/2026-05-20_step1_restructure_variables_and_type_subtype.md.

Define the universe of valid roof permits

DS audit · 2026-05-21 — 2 SERIOUS findings · does not ship as written.
  • S1 · label quality unvalidated. The "is_roofing recall ≈ 99 %" headline is circular — audit_v5.py measures the roof_fn stratum as v4=ROOFING AND v5≠ROOFING, i.e. v5-vs-v4 agreement, not recall against an independent ground truth. No stratum samples roofs both versions miss; the true false-negative rate is unknown (the project_type_investigation already found a 2.21 % vocabulary gap — epdm / sbs / single ply).
  • S2 · event_date rule frozen but not materialized. 1.5 is FROZEN, but labels.parquet carries no event_date column and steps/06 still names the legacy roofing_label.py (old broad_classify, pre-restructure) as owner. The Step 3 T0 anchor and the Step 4 label window both need event dates with the +90 d cap — no current-design implementation produces them.

Also MEDIUM: roof_action = NA for ~54 % of ROOFING items (12086) — gutted Step-4 positive set unless AMBIGUOUS absorbs them; gold_vintage string inconsistent across scripts. Scope-confusion clean.

audits/2026-05-21_step1_ds_audit.md

For every Gold permit row, the Step 1.2 classifier emits one MECE column — permit_scope, a list of {type, action} items covering every (object, verb) the permit touches: 20 permit_type values × 7 permit_action values (see 1.2). The old flat 14-category enum (ROOFING_REPLACEMENT, … with a separate ROOFING sub-class) was superseded by this two-axis model in the v5 restructure — see the archived 14-cat spec. Output: data/sandbox/classify_v5/<FIPS>/labels.parquet; contract in 1.4.

1.1 · Variable inventory DONE

Enumerate every Gold column and assign each a role — LABEL_SIGNAL (feeds 1.2 classifier) / FEATURE (downstream Step 5) / METADATA (provenance only) / DROP (all-null or noise). Restructure ratified 2026-05-20 (ADR 2026-05-20_step1_restructure_variables_and_type_subtype.md). Done: all 54 Gold parquet columns inventoried across the 31 FIPS → data/gold_variable_inventory.md (7 LABEL_SIGNAL · 23 FEATURE · 22 METADATA · 2 DROP), generado por scripts/roofing/inventory_v3.py.

steps/01_variable_inventory.md

1.1.1 · Schema dump DONE

Per-FIPS column inventory: name, dtype, originator. 54 columns, all present in 31/31 FIPS (no AHJ-specific columns). En la master role table de gold_variable_inventory.md (cols dtype, originator, n_fips).

1.1.2 · Null + cardinality DONE

Per-FIPS null rate + distinct count per column. gold_variable_inventory.md: master table with null_p50/null_max/dist_p50 + a complete per-FIPS null matrix (54 cols × 31 FIPS).

1.1.3 · Role assignment DONE

Cada columna con un rol — 7 LABEL_SIGNAL (incl. BUSINESS_NAME promovido de auxiliar), 23 FEATURE, 22 METADATA, 2 DROP (STORIES 96 % null, UNITS 99 % null). Persistido en notes/Roofing/data/gold_variable_inventory.md.

1.2 · Permit Gold taxonomy v5.3.3

Classifies every Gold permit into a single MECE column — permit_scope, a list of {type, action} items covering every (object, verb) the permit touches. A permit can do several things at once (58 % carry 2+ trades), so the taxonomy is multi-label by design; the type↔action pairing is intrinsic (same struct), not positional. is_roofing / roof_action are NOT taxonomy columns — they are roofing-project projections, derived downstream (Step 5) from permit_scope. Path: v5.0.0 single-label lost ~200 K real roofs to a tie-break bug → v5.1.0 multi-label dissolved it (is_roofing ≈ 99 %, types ≈ 93 %, 7-iteration autonomous audit loop) → v5.3.0 collapsed the schema to the single MECE column permit_scope → v5.3.1-v5.3.3 are precision/recall fixes from the 2026-05-21 audit: ROOFING keyword expansion, AHJ inspection-boilerplate stripping, and a location guard for " roof ". v5.3.3 audit: MECE pass, both defects closed. Independent recall check (vs BuildZoom PROJECT_TYPE, the S1 audit fix) measures is_roofing recall at 96.3 % on the 30-FIPS dev layer — an optimistic upper bound, ~198 K misses surfaced; the prior "≈ 99 %" was circular (v5-vs-v4 agreement). Pending next version: drop PROJECT_TYPE Tier-3 + add an explicit bn_roofer tier (needs sign-off). v4.0.1 stays shipped pending Ignacio/Eduardo sign-off.

permit_type — 20 values (the object): ROOFING · SOLAR · HVAC · ELECTRICAL · PLUMBING · POOL · FIRE · GENERATOR · WINDOWS · DOORS · GARAGE · SHED · DECK · FENCE · FOUNDATION · SIGN · SITEWORK · BUILDING · OTHER · UNKNOWN.

permit_action — 7 values (the verb): NEW · REPLACEMENT · REPAIR · ADDITION · ALTERATION · DEMOLITION · NA.

Permit (DESCRIPTION)permit_scope
"Tear off & re-roof, class-A shingles"[{ROOFING, REPLACEMENT}]
"Install roof-mounted PV system 6.2 kW"[{SOLAR, NEW}] — not roofing; the roof is only the location
"Tear off shingles, re-roof, install solar PV"[{ROOFING, REPLACEMENT}, {SOLAR, NEW}] — one permit, two items
"Remove existing solar panels"[{SOLAR, DEMOLITION}]
"Repair roof leak"[{ROOFING, REPAIR}]
"Demolish detached garage"[{GARAGE, DEMOLITION}]
"New single-family residence"[{BUILDING, NEW}]

Result preview — 50 random permits, before (v4.0.1) vs after (v5.3.3). Review aid; reproducible via scripts/roofing/before_after_sample.py 50 0.809817. Click to expand.

50 before/after examples — v4.0.1 → v5.3.3, random (seed 0.809817)
#FIPSTYPEDESCRIPTIONBEFORE v4.0.1AFTER v5.3.3 permit_scope
126163BuildingExterior alterations per documents. (subject to field ap…RENOVATION / NABUILDING:ALTERATION
226163CommercialSidewalk sale june 6 thru 8 2014SITEWORK / NAOTHER:NA
309001PluNew 2 family dwelling. Install U.G. plumbing waster line…NEW_CONSTRUCTION / NEWPLUMBING:NEW
409001BuildingTentRENOVATION / NABUILDING:ALTERATION
517031Renovation/alterationRm-5 a2-multiunit: replace existing open wood rear porch…RENOVATION / NAOTHER:NA
617031Driveway repair residentialRemove the old concrete and put new concreteSITEWORK / NASITEWORK:REPAIR
717031New constructionSelf certification review: erect masonry 3 D.U. Building…NEW_CONSTRUCTION / NEWBUILDING:NEW
817031Permit – express permit programReplace current antenna c4 with one (1) tenxc bsa-da65-1…ELECTRICAL / NAELECTRICAL:REPLACEMENT
917031plumbingPLUMBING / NAPLUMBING:NA
1036029Residence. AlterationRENOVATION / NAOTHER:NA
1149035HVACWater heaterHVAC / NAHVAC:NA / PLUMBING:NA
1249035Commercial plumbingBuildingPLUMBING / NAPLUMBING:NA
1349035CooNEW_CONSTRUCTION / NEWOTHER:NA
1449035Building/permit/commercial/naBuildingOTHER / NABUILDING:NA
1548201Electrical permitExpired*03-15-1997*occ. Report/lounge uk codeELECTRICAL / NAELECTRICAL:NA
1648201Legacy building - electrical res…1 reconnectELECTRICAL / NAELECTRICAL:NA
1748201Structural overtimeS.f. Residence w/att. Garage (1-2-5-r3-b) 12 irc/15 ieccNEW_CONSTRUCTION / NEWBUILDING:NEW
1848201Fire marshal alarm permitEnclose building for gymnasium 1-1-2-a3-n O.L. = 122FIRE / NAFIRE:NA
1948201Electrical permitRemodel for coffee carry-out kiosk to floor 1 lobby area…ELECTRICAL / NAELECTRICAL:ALTERATION
2048201Mechanical permitResidential mechanical permitHVAC / NAHVAC:NEW
2104013Drinking water site - drinking w…OTHER / NAOTHER:NA
2204013Residential_additionAddition of a living room patio with cover with electr…ELECTRICAL / NAOTHER:NA
2312031Electrical permit12264747/06replace 100 amp panelELECTRICAL / NAELECTRICAL:REPLACEMENT
2412031Mechanical permit1 kitchen range hood1 exh fan 4c836 daytonHVAC / NAHVAC:NA
2512031Mechanical permit1 heat pump 2twr3042 trane 3.5 tons 1 air handler 2tec3f…HVAC / NAHVAC:NA
2612031Electrical permit$ 5.00 repaired water heater circuitwith 2 wire nuts & a…ELECTRICAL / NAELECTRICAL:REPAIR / PLUMBING:REPAIR
2753033Construction permitTenant improvements to research + development laboratory…INTERIOR / NABUILDING:NEW
2853033Side sewer permitPLUMBING / NAPLUMBING:NA
2906073Electrical pmtElectrical pmt:1402/kELECTRICAL / NAELECTRICAL:NA
3037119Mechanical permitHVAC / NAHVAC:NA
3137119Mechanical permitHVAC / NAHVAC:REPLACEMENT
3206037Bldg-alter/repairRe-frame portions of (e) garage (all walls & footings to…GARAGE / NAOTHER:NA
3306037Bl: aircraft for hire unit 311OTHER / NAOTHER:NA
3406037Bldg-alter/repairSupplemental to permit 21010-10000-05624 to capture the …RENOVATION / NAOTHER:NA
3506037Bldg-alter/repairCantilever steel deckDECK / NAOTHER:NA
3606037Bldg-alter/repairGuardrail replacement on the 2nd floor of existing 2-sto…RENOVATION / NABUILDING:REPLACEMENT
3747037Capl - plumbing permitInside plumbingNEW_CONSTRUCTION / NEWPLUMBING:NA
3847037Electrical permit********2 gas heater feeds 11 high bays all on ex. Svc. …ELECTRICAL / NAELECTRICAL:ADDITION / PLUMBING:ADDITION
3927053Ele exst multi familyReplace old 60 amp services with new 100 amps in each un…ELECTRICAL / NABUILDING:REPLACEMENT
4013121Residential newNew duplex residenceNEW_CONSTRUCTION / NEWOTHER:NA
4132003Residential electricMain panel change out with new 200 amp main and panel wi…SOLAR / NASOLAR:REPLACEMENT / ELECTRICAL:REPLACEMENT
4232003plumbing single familyGalgano residenceNEW_CONSTRUCTION / NEWPLUMBING:NA
4312086BuildingSingle family res-clust-zero lot-townOTHER / NABUILDING:NA
4412086BuildingSingle family res-clust-zero lot-townROOFING / NAROOFING:NA
4512086Electrical - master tv antennaNEW_CONSTRUCTION / NEWELECTRICAL:NEW
4612086ElectricalOffice buildingsELECTRICAL / NAELECTRICAL:NA
4734003plumbingWater heaterPLUMBING / NAPLUMBING:NA
4851059Residential addition/alterationScreen porch with shed roof on concrete slab per ffx cou…ADDITION / NAOTHER:NA
4912095Electrical PermitReplace Expire Permit E05002204. For Unit 1328-A Duplex.…ELECTRICAL / NAELECTRICAL:REPLACEMENT
5018097plumbing permit-non-residential-…PLUMBING / NAPLUMBING:ALTERATION
RULES_REFERENCE.html#step1 — plain-language labeling rules steps/02_permit_taxonomy_v5.md audits/2026-05-21_v5.3.3_classifier_audit.md audits/2026-05-21_project_type_investigation.md audits/_v5.3.3_audit_data/before_after_sample.md — 50 before/after examples (v4.0.1 → v5.3.3)

1.4 · Output contract DONE

Contract for the labels parquet, joined by building_permit_id: schema is building_permit_id · fips · permit_scope (list<struct<type, action>>) · spec_v · gold_vintage. Consumers derive is_roofing / roof_action from permit_scope — never re-classify. Versioned by spec_v semver + gold_vintage. Locked 2026-05-21 to the v5.3.3 artifact; provenance, cache_manifest and the partitioned Platinum-lake path are deferred to the national publish.

steps/05_labels_output_contract.md

1.5 · Canonical event date rule FROZEN

event_date = MIN(non-null status dates), capped at today + 90d to defuse future-date corruption (Pinellas anomaly). Downstream feature anchoring is T0-relative, not event-relative — see Step 3 (ADR 2026-05-21_step3_t0_anchor.md).

steps/06_event_date_rule.md

Combine vendor coverage with reality · per-muni training-inclusion rule

DS audit · 2026-05-21 — 3 SERIOUS findings · does not ship as written.
  • S3 · training-universe leakage. coverage_decision is one static label per tuple, computed over the full 1900→2026-03 permit history (match_rate numerator + first_permit span the whole record). steps/07 consumes Step 2 as a frozen input, so a walk-forward fold standing at T0 = 2021 trains on a universe whose INCLUDE membership was decided by 2022-2026 permits — inflates the Step 9 backtest. Fix is code-level: make coverage_decision a function of fold T0.
  • S4 · match_rate invalid as a decision metric. Numerator is pre-v5 (LABELS_SPEC_V="coverage_pipeline_pre_v5", not the frozen v5.3.3 permit_scope); the 50 % / 25 % cutoffs are self-admitted Zoom-call engineering judgment; the metric cannot separate "vendor has no coverage" from "our address join failed".
  • S5 · permanent selection-bias trap. "When in doubt EXCLUDE" → 47 % SFH EXCLUDE / 43 % FLAG / 9.6 % INCLUDE, and the 2.4 elbow that would recover geographies is circularly blocked on Steps 4-9 which only ever see the INCLUDE set. No in-pipeline path back.

Also MEDIUM: Gate 2 hard-EXCLUDEs 1,524 tuples on an empty 2024 label even when the reason says "year not ingested" and earlier years = Yes; INCLUDE-set base-rate shift uncorrected; ~35 % of SFH gets a county-level decision presented as city-level.

audits/2026-05-21_step2_ds_audit.md
3-layer audit · 2026-05-21 — PASS (4 rounds × 3 agents × 50 random cases = 1,800 graded).
  • L1 · FA Municipality cleaning — final non-LA error rate ~1.3 % (R1 5.3 % → R4 1.3 % after 10 classifier patches: bare-word BASIN/LIGHTING/IMPROVEMENT-DIST/TRANSPORTATION tokens, SERVICES? plural, BORO + (TOV) suffix strips, UN-?INCORP hyphen tolerance, HTS→HEIGHTS, digit-prefix stub, DIST. period tolerance).
  • L2 · FA↔BZ match0 % error across 600 graded cases.
  • L3 · coverage_decision0 % error across 600 graded cases (gate 3 FLAG-only verified live).

Residual ~1.3 % L1 = LA County (06037) tax-area garble (98 % of 06037 Municipality strings, no decision impact — all fall to city_under_countyCA_Los Angeles; deferred per the original brief item 4) + ~3 bare-township names (need a gazetteer to catch).

audits/2026-05-21_step2_3layer_audit_summary.md — full 4-round trace STEP2_RUNBOOK.md — ops manual (run, audit, schemas, known limitations)

For every (FIPS, jurisdiction, fa_muni), produce coverage_decision ∈ {INCLUDE, EXCLUDE, FLAG}. Only INCLUDED tuples feed training and delivery. False negatives are the dangerous failure mode → when in doubt, EXCLUDE. Output: evidence/sources/coverage/coverage_decisions.parquet.

2.0 · First American Municipality standardization DONE 2026-05-21

The match-table spine starts from FA's Municipality field — defined by FA as the legal jurisdiction, "not necessarily the property city". Each value is sorted into 4 status buckets: city_named (resolve to a city), unincorporated (FA confirms no incorporated place → county AHJ, confident), district_code (school/fire/tax district — no AHJ signal), unknown (NULL/junk). Only city_named carries a city; the other 3 set city = NULL. NULL is not treated as unincorporated — FA has a separate explicit "UNINCORPORATED" value. Classifier scripts/roofing/classify_fa_municipality.py built + audited on the local layer (72.76M parcels, 1,420 FIPS, no AWS) over 4 cycles × 4 AI agents (85 % → 87 % → 92 % → 93 % → ~96 %). Distribution (non-NULL strings): city_named 89.3 % · unincorporated 7.7 % · district_code 2.0 % · unknown 1.0 %.

RULES_REFERENCE.html#step2 — plain-language coverage rules (FA & BZ municipality → canonical, the 4 gates) data/fa_municipality_dictionary.md audits/2026-05-21_fa_municipality_4bucket_audit.md

2.0b · BuildZoom jurisdiction → canonical DONE 2026-05-21

The BZ side of the two-sided standardization. BuildZoom names a jurisdiction STATE_County_City / STATE_County; the provider coverage CSV and the permit feed share this vocabulary (2,135 of ~2,300 strings overlap exactly), so one normalizer serves both. scripts/roofing/normalize_bz_jurisdiction.py parses each of the 2,497 distinct strings → canonical key (state, county_fips, canonical_place, place_type) — the same shape the FA side emits, so the two can be matched. county→FIPS uses the complete Census 2024 counties gazetteer (3,222 counties; county_master only held our 1,419-county set). Fixes: CT legacy counties (CT dropped counties for planning regions in 2022), diacritic folding, NYC/GA/MD/DC aliases. Result: county→FIPS 99.7 % (0 unmatched), status resolved 80.0 % / county_level 19.7 % / malformed 0.3 %. Audited 4 cycles × 4 AI agents = 404 row-checks, 100 % all 4 cycles.

audits/2026-05-21_bz_canonical_audit.md

2.0c · FA ↔ BZ canonical match DONE 2026-05-21

Joins the two standardized sides on the shared canonical key (county_fips, canonical_place)scripts/roofing/match_fa_bz.py, 32,179 FA (fips, Municipality) rows. Outcome per FA municipality: city_matched (FA city ↔ BZ city jurisdiction — 31.1 % of SFH), city_under_county (FA city, BZ covers the county not the city — 22.6 %), county_matched (FA unincorporated/district/unknown ↔ BZ county jurisdiction — 12.9 %), no_bz (no BZ jurisdiction for the county — 33.3 %). 66.7 % of FA SFH falls under a BuildZoom jurisdiction. This match table supersedes the v2.7 one-sided match_table_v2 as the coverage spine: build_match_table_v3.py materializes it in the v2-compatible schema (enriched with measured sfh_with_permit + provider labels) and build_coverage_decisions.py now consumes match_table_v3. Known gap: FIPS 48113 (Dallas) silver is a dangling symlink — excluded until the silver layer is repaired.

audits/2026-05-21_fa_bz_match.md

2.1 · Vendor coverage labels INGESTED

BuildZoom publishes per-(jurisdiction × year) labels in {Yes, Some, None, empty} — what the provider claims, not what we measure. Join key is COLLECTION_POINT_3PART (not "jurisdiction"). Weekly call with BZ counterpart pending Camilo intro.

steps/03_geographic_coverage.md

2.2 · Our match-rate diagnostic SHIPPED NATIONALLY

Per-FIPS pure match rate (excluding unit-numbered SFH from the denominator). National avg = 14.98 % (~15 %). Shipped 2026-05-18 to 8020roof-coverage.web.app. Will refresh once Step 1.1 national rerun lands. Honest caveat (2026-05-22): the numerator is the all-time SFH-with-≥1-matched-roof-permit count under the pre-v5 roof filter (labels_spec_v=coverage_pipeline_pre_v5, DS audit S4), with no date cap — so the Pinellas future-dated permits (last_permit 2055/2060/2066) and decades-old permits count equally. It is an optimistic upper bound, not a current-state visibility figure. FL averages 62.2 % (54 fips); rest-of-US 10.7 %. The model-relevant coverage (clean v5.3.3 roof event in a usable recent window) is lower and needs the v5.3.3 + event_date-cap national rerun (blocked AWS).

steps/03_geographic_coverage.md

2.3 · Per-tuple inclusion decision tree DONE 2026-05-21

Four gates, first-fail-decides: (1) tuple in BZ provider list, (2) provider label ∈ {Yes, Some}, (3) match_rate FLAG-only≥ 50 % → INCLUDE, else FLAG (gate 3 never hard-EXCLUDEs: match_rate can't tell "no vendor coverage" from "our address join failed" — DS audit S4), (4) sub-muni veto. Implemented in scripts/roofing/build_coverage_decisions.py over all 32,179 (fips, jurisdiction, fa_muni) tuples. Gate 4: 86 confirmed non-AHJ special-district vetoes (USD / FD / SD# / MSTU / CDD / EMS / water / road / ambulance / metro district) → coverage_rules/submuni_veto_list.csv. Thresholds are documented defaults pending the 2.4 elbow.

coverage_rules/coverage_inclusion_rules.md audits/2026-05-21_coverage_decisions.md audits/2026-05-21_gate4_veto_before_after.md — 81 gate-4 vetoes, before/after

2.4 · Municipality-level training-inclusion rule SPEC

"Solo los que pasan el test se usan para entrenar" — per-FA-muni threshold over a trailing-5-year window, using roof_sub_class ∈ {REPLACEMENT, AMBIGUOUS}. Default threshold ≥ 25 % until backtest sweep argues otherwise (elbow on held-out recall).

steps/03_geographic_coverage.md

2.5 · Output contract DONE 2026-05-21

Materialized to evidence/sources/coverage/coverage_decisions.parquet — one row per (fips, canonical_jurisdiction, fa_muni), 32,179 rows. Columns: coverage_decision, coverage_decision_reason, match_rate, muni_match_rate, provider labels, provider_first_year, back-pointers labels_spec_v + gold_vintage, last_evaluated_at. Materialized run (post DS-audit + 3-layer-audit patches): 1,006 INCLUDE / 9,445 FLAG / 21,728 EXCLUDE tuples (9.6 % / 46.4 % / 44.0 % of SFH). Covers 1,420 of 1,421 FIPS — 48113 Dallas excluded (dangling FA silver symlink, needs an AWS re-pull). A standing coverage_recovery_queue.csv (275 jurisdictions / 9.20 M SFH) lists geographies whose measured evidence contradicts a conservative decision — the non-circular input to the 2.4 elbow (S5). Dashboard wiring of coverage_decision_reason still pending.

steps/03_geographic_coverage.md
Hard rule. False negatives are the dangerous failure. If a muni's match rate sits at 24 % and we can't tell whether BZ doesn't cover it or our matching is off, we EXCLUDE. We can add back later when backtest evidence supports it; we cannot undo a client trust hit.

As-of enrichment — property state at T0 DONE (3 counties) 2026-05-21

Step 3 COMPLETE for the 3-county engagement (Hernando/Pasco/Pinellas), 2026-05-21. As-of T0 enrichment over 497,856 coverage-INCLUDE SFH × 7 fold anchors (2022-05 → 2025-05), leakage-clean (permit features filtered event_date ≤ T0 with 0-violation assertion; behavioral recency features 0 negative = no future dates). 57-column enriched_full.parquet per anchor. Two halves:
  • Local (build_enrichment.py) — FA physical (year_built, property_age@T0, lot/living/building sqft) + prior-permit-history roofing core (roof_age_months, n_wholeroof≤T0, n_roof_repair_24m, n_hvac/solar). 70-76 % carry a prior whole-roof permit.
  • Behavioral (build_enrichment_behavioral.py) — unblocked by the silver-data-lake IAM grant; reuses the FA REM feature vocabulary (the data that fed Apollo, NOT the Apollo model). 23 distress flags + count, owner/occupancy (absentee/high_equity/vacant/is_listed), point-in-time valuation (AssdValue/MarketValue/CurrentAVM), leverage (CLTV/LTV/DaysOwnership), recency (months_since_prev_sale, mortgage_age — §7 audit-pending). REM-matched 96-99 %, any-distress 6-10 %, median AVM $396K.
Remaining family: hurricane/storm (Tier F', NOAA — separate pull); the roof_age × months_since_hurricane interaction. audits/2026-05-21_enrichment.md (local) audits/2026-05-21_enrichment_behavioral.md (REM)

The training unit is the pair (property_id, T0) — not "a property with an event". Features anchor on T0 (the prediction moment) plus trend anchors T0−3 / T0−6; every feature value must be knowable at T0. The permit event never sets the feature anchor — it is only read forward from T0 to set the label (Step 4): y = 1 iff a qualifying roof event in (T0, T0+6]. As-of join takes the latest FA / Silver REM snapshot ≤ each anchor.

Leakage rule. "Past data" is not automatically safe. A feature is leaky if dated after that row's T0, even if still before today. The earlier "T-1 too close" framing was wrong: a refi one month pre-permit is a real, serve-time-available signal, not leakage — true leakage is a post-T0 feature or an FA snapshot vintage backfill. No artificial freshest-month exclusion.

Repair vs replacement. Different drivers — handled distinctly. Positive label = roof replacement (permit_action ∈ {REPLACEMENT, AMBIGUOUS}); REPAIR-only permits do not count as positives (they dilute the target). Prior repair history (n_roof_REPAIR_24m) is carried as a feature — a roof patched twice is near end of life.

Hurricane signal. Primary exogenous driver — storm → insurance claim → reroof, lagged months. Carry months_since_hurricane / in_hurricane_corridor; the lift is in the roof_age × months_since_hurricane interaction (seeded in Step 5).

Feature source. Internal feature families — property physical (incl. year built, roof age), distress, equity, owner state, sale recency, tenure — are reused from the Apollo macro-model feature builder. Shared dependency is the feature pipeline, NOT the Alpha model (roofing does not compete against Alpha). Reused features re-anchor to the roofing T0 and inherit the CLAUDE.md §7 leakage audit.

Decisions to lock: anchor set ({T0, T0−3, T0−6} · add T0−12?) · FA snapshot vintage audit (point-in-time integrity — gates whether near-T0 features need a guard) · negatives matched on month + FIPS, K-ratio (Step 4 cross-link).

steps/12_as_of_enrichment.md decisions/2026-05-21_step3_t0_anchor.md

Walk-forward folds & negative sampling PARAMS LOCKED 2026-05-21

Locked: horizon H = 6 months (label window (T0, T0+6]) · embargo 6 months = H (a shorter embargo leaves adjacent folds' label windows overlapping) · fold cadence 6 months (eval windows disjoint → honest variance estimate).

Fold layout — backward from the latest observation. latest_obs = most recent month with complete data. Fold x eval anchor E_x = latest_obs − 6·x; eval window (E_x, E_x+6]; fold x trains on all rows with T0_train ≤ E_x − 6. The first validatable fold is x=1 (latest_obs − 6) — latest_obs itself is the production scoring point, not a fold. Run as many folds as history allows (~6-8); fold x=1 is the most production-like. e.g. latest_obs = May 2026 → fold 1 eval (Nov 2025, May 2026], fold 2 (May 2025, Nov 2025], …

Negatives. Case-control: per positive, K negatives at the same FIPS + T0, from the coverage-INCLUDE SFH universe. Positive label = roof replacement (permit_action ∈ {REPLACEMENT, AMBIGUOUS}); a property supplies positive and negative rows at different T0s (discrete-time panel). K pending base-rate measurement.

Held-out final test stays carved out. The roofing model is judged on a walk-forward backtest (Step 9): lift over a random list of equal size. Baseline = the random list; roofing does not compete against Alpha (the "March 2025 vs Alpha" gate in CLAUDE.md is the macro sales model). Dev folds must not touch the backtest window's cells ± embargo. Open: pick a window old enough that its permit data is fully settled.

Decisions to lock: K negatives per positive (measure base rate first) · backtest window — old enough for settled permit data · positives-per-(FIPS, fold) count feeding the Step 6 per-FIPS-vs-global call.

steps/07_walk_forward_folds.md

Synthetic features · Internal × External SPEC · audited 2026-05-21

Feature columns on the enriched (property_id, T0) anchors from Step 3. Internal = property + owner (property physical incl. year built / roof age, distress, equity, tenure, sale recency, prior-permit history) — reused from the Apollo feature builder, not the Alpha model. External = market + environment.

Feature selection — expert pre-filter, then let the model judge. ~320 candidate columns need disciplined selection. Three steps: (1) expert pre-filter — drop obvious non-signal + no-variability columns, high confidence only; keep the "plausible but I doubt it" bucket. (2) quick model + permutation importance — multivariate, captures interactions; do NOT filter on univariate correlation, which misses roof_age × hurricane-type effects; drop importance ≈ 0. (3) train on the survivors. The distress family (~276 columns) is the main pre-filter target — collapse to a count + one trend + top-N flags. Drop the 2nd-diff for the first model.

External features split by whether they vary within (FIPS, T0). Step 4 matches negatives in the same FIPS + T0, so any feature constant across that cell — county macro (FRED rates, HPI, unemployment), a county-level hurricane flag — is identical for the positive and its negatives; the case-control sampling removes its signal entirely. Those do NOT go in the per-FIPS model — they belong at calibration / threshold tuning (Step 7 / 9). Useful External = features that vary within the cell: sub-county neighbour permit density, parcel-level storm exposure.

Hurricane must be parcel-level — peak wind / distance to track at the parcel, not a county-month flag (a county flag is wasted under T0-matched case-control). Neighbour permit density is a leakage trap — the rolling window must end at T0 and exclude the subject property.

Interactions — let the model do it. Gradient-boosted trees learn interactions automatically; hand-coding a boolean AND is redundant. Skip hand-coded interactions for the first model; reserve hand-coding for ratios.

Decisions to lock: feature budget number (after Step 4 reports positives-per-FIPS) · distress collapse set · insurance / claim climate source — sub-county grain, highest-value gap · FA ROOF_COVER availability for roof material.

steps/08_synthetic_features.md

Model training SPEC · audited 2026-05-21

One global model, not per-FIPS by default. Roof replacement is a rare event — a per-FIPS model on a small county (~100-300 positives/fold) re-learns the basics from thin data, while a global model learns them once from millions of rows. Default = one global gradient-boosted model with FIPS + region as features (the tree carves FIPS-specific behaviour where data supports it, pools where it does not). A dedicated per-FIPS model is a carve-out, justified only on evidence — not tied to the 5 gold-tier dev FIPS.

Algorithm. LightGBM (native categorical for FIPS) — confirm against HistGB / logistic on a roofing-specific arch sweep. The prior arch matrix was the sales model, not roofing.

Metrics — on the true base rate, not the case-control eval. The Step 4 eval is 1:K downsampled; AUC-PR read off it is inflated and top-decile recall is optimistic. Primary metrics must be computed on a holdout with the true population base rate. Success = lift over a random list of equal size on a walk-forward backtest (Step 9) — roofing does not compete against Alpha, the REI sales heuristic.

No re-weighting. The training table arrives already downsampled 1:K from Step 4 — Step 6 does not re-sample or re-weight; prior correction is Step 7's job. Tune hyperparameters once on pooled data and reuse — not a fresh 50-trial search per FIPS (overfits thin data). Early-stopping validation is carved from train, never the eval window.

Decisions to lock: global+FIPS-feature vs per-FIPS carve-out (+ row-count threshold) · algorithm (after a roofing arch sweep) · true-prior holdout for metric reporting · backtest window for the held-out evaluation (Step 9).

steps/09_model_training.md

Calibration — turn rankers into forecasters SPEC · audited 2026-05-21

Map raw model scores to probabilities that mean what they claim. Critical — calibrate against the true base rate, not the case-control prior. The model is trained on the Step 4 table downsampled 1:K, so its scores reflect a ~1 % sample prior, not the real (rare) population rate. Calibrating against a slice of that 1:K pool yields probabilities wrong by 1-2 orders of magnitude ("23 %" when the truth is ~0.3 %). Fix: calibrate on a true-prior hold-out, or apply an explicit log-odds prior correction. This is the prior-correction Step 4 and Step 6 defer here — measure the true base rate, do not assume it.

Method + scope. Platt (sigmoid) default — stable on thin data; isotonic is data-hungry, use only where the calibration sample is large. Calibrate globally (matches the global model from Step 6) — per-FIPS isotonic on ~1,421 counties starves on data; split by region only if markets genuinely differ.

Acceptance — measured on the true-prior population: calibration error within a relative tolerance (absolute percentage points are the wrong frame for a rare event), reliability diagram monotonic, Brier improves. A region that fails downgrades to "ranker only — no probability claim" (score still ranks; no fake probability shown). The ±15 % in the legacy sketch is the sales model's number — set roofing's own.

Decisions to lock: measured true base rate (shared with Step 4) · roofing tolerance (value + relative vs absolute) · scope global vs per-region · method.

steps/10_calibration.md

Output filter / sanity GATE SPEC · audited 2026-05-21

The 25K-list incident (finding 67) is fixed here, at the gate, not in the label. After calibration, apply hard rules: recent-roof exclusion (last_ROOFING_age < 15 yr), VENDOR_BLIND (property outside an INCLUDE tuple — Step 2), dedup (one mailpiece per household), confidence (combines coverage match-rate + calibration).

The gate is also a model-health detector — by design. The 15-year exclusion is intentional: if the model scores a recently-rerofed property high, the model is probably wrong. The gate catches it before the client sees it — and the catch itself is the signal. So the gate's firing rate is a monitored metric (Step 9), not silent plumbing — a rising count of high-scored recent roofs means fix the model, not the gate. A tighter very-recent band (~2-3 yr + high score) is the sharpest model-error tell, tracked separately.

Never overwrite the score. calibrated_probability is kept intact for every row; the gate only sets delivery_eligible + delivery_reason_excluded. That is what lets the recent-roof catch tell us the model is wrong — "model said 0.8, gate excluded for RECENT_ROOF". Overwriting the score to 0 / NULL would erase that evidence.

Decisions to lock: very-recent alarm band threshold + high-score cutoff · dedup unit (household, not mailing address — handle multi-property owners) · 3-level confidence definition · soft vs hard filter at the 15-yr boundary.

steps/11_output_filter.md

Deployment + monitoring SPEC · audited 2026-05-21

Value demonstration — the walk-forward backtest. The client-facing proof: stand at a past T0, train on ≤ T0, generate the list we would have delivered, then check the (T0, T0+6] outcome. "If you'd been a client in January, here's the list — of all the roofs Jan–Jun, it caught X %." Honest only with strict T0 discipline and a backtest window old enough that permit data is fully settled (permits lag — a too-recent window undercounts).

The value metric. Recall (capture rate) at a fixed list size N — recall alone is gamed by enlarging the list. The honest headline is lift over a random list of equal size: "our 15 000-property list caught 85 % of the roofs — 11× better than random." Show precision@N too (the client works the list) and a gains curve so the client picks N to capacity. Baseline = the random list — this closes the held-out-test open item; roofing does not compete against Alpha.

Ground truth is the roof permit, not the CRM. Measure recall / precision against new permit data (the label's own source). The CRM (Clients_Deals_RBB_V1) measures business conversion — a narrower, different outcome; track it separately. Avoid the feedback loop: score the full population each cycle, not just the delivered list, or keep a randomised control.

Monitoring — fast leading signals vs slow ground truth. Fast/daily: input-feature drift (PSI on features, not just score), score drift, the Step 8 gate firing rate. Slow/lagging: recall / precision vs permits — resolves ~6 months late, so it is confirmation, not a weekly trigger. Thresholds tune per-region on the calibrated probability (calibration already fixed score incomparability — no per-FIPS needed). Hybrid retraining: 90-day floor + fast-signal ceiling.

Decisions to lock: backtest window (settled permit data) · debias method (full-population scoring vs randomised control) · alert routing / on-call.

steps/13_deploy_monitor.md

Status summary

StepSub-stepStatusBlocker
STEP 1 · LABELING 1.1.1 Schema dump (54 cols · 31 FIPS)DONE
1.1.2 Null + cardinalityDONE1.1.1
1.1.3 Role assignmentDONE1.1.2
1.2 Permit Gold taxonomy — permit_scope classifier v5.3.3DONE1.1.3
1.4 Labels parquet output contractDONE1.2 validation
1.5 Canonical event date ruleFROZEN
STEP 2 · COVERAGE 2.1 Vendor CSV ingestedDONE
2.2 Match-rate diagnostic (national)DONERefresh after 1.1 rerun
2.3 Decision tree (gates 1-4) + 2.5 output contractDONEcoverage_decisions.parquet materialized
2.4 Muni-level threshold elbowTODO · bindingBlocked: needs backtest → Steps 4-9
STEP 3 · ENRICHMENT As-of T0 anchor + FA / Silver REM joinsSPEC · design locked 2026-05-21Impl: Steps 1 + 2 frozen
STEP 4 · FOLDS Walk-forward folds (H=6m, embargo=6m, cadence=6m) + negativesSPEC · params locked 2026-05-21Impl: Step 3 frozen
STEP 5 · FEATURES Internal × External · feature budget · case-control-aware External splitSPEC · audited 2026-05-21Impl: Step 4 frozen
STEP 6 · TRAINING Global GBM + FIPS feature · true-prior metricsSPEC · audited 2026-05-21Impl: Step 5 frozen
STEP 7 · CALIBRATION Global calibrator · true-prior correctionSPEC · audited 2026-05-21Impl: Step 6 frozen
STEP 8 · OUTPUT GATE Recent-roof gate (model-health detector) · VENDOR_BLIND · dedupSPEC · audited 2026-05-21Impl: Step 7 frozen
STEP 9 · DEPLOY + MONITOR Walk-forward backtest (lift vs random) · permit-truth monitoring · retraining cadenceSPEC · audited 2026-05-21Impl: Step 8 frozen

Next-action queue

DS audit follow-ups (2026-05-21) — 5 SERIOUS findings, all APPLIED

From the Step 1 + Step 2 audits. Detail: step1 audit · step2 audit.

  1. DONE S1 · independent recall check. measure_recall_independent.py — recall vs BuildZoom PROJECT_TYPE = 96.3 % (was a circular "≈ 99 %"). artifact. Remaining: hand-grade the ~198 K FN cell + a v5-not-ROOFING random sample (true ground truth).
  2. DONE S2 · event_date relocated. steps/06 rewritten — the rule lives in the Step 3/4 permit reader, read forward from T0 only; legacy roofing_label.py marked legacy.
  3. DONE S3 · T0-relative coverage — spec'd. coverage_inclusion_rules.md / steps/03 / steps/07 now mandate coverage_decision(tuple, T0). Remaining: the per-fold-anchor materialization is code, to land before Step 4 consumes the parquet.
  4. DONE S4 · match_rate fixed. Gate 3 is now FLAG-only (never hard-EXCLUDE) — code shipped + re-run. Remaining: the v5.3.3 numerator refresh needs the Step 1.1 national rerun (AWS — REGLA DE ORO).
  5. DONE S5 · selection-bias trap broken. coverage_recovery_queue.csv emitted every run (274 jur / 9.17 M SFH); the 2.4 backtest re-scoped to evaluate the full SFH population, not INCLUDE-only.

Next week

  1. Design the backtest threshold sweep (held-out period, success metric).
  2. BZ counterpart weekly call (Camilo intro pending) — clarify "Some" semantics.
  3. Wire coverage_decision_reason through to evidence dashboard muni profile.
  4. Spec build_label_universe.py + build_delivery_list.py consumers.

Blocked / pending external

  1. 1.1 national rerun — AWS confirm (REGLA DE ORO) + Otata scaling.
  2. 2.1 Camilo → BZ counterpart intro (BZ contact on vacation this week).
  3. Refund proposal numbers — pending coverage decision retro-applied to prior 25K list.

Champion audit · v22 / V22.1

The 9 steps above are the rebuild spec (implementation not started). This section is about the incumbent the rebuild will replace: v22_super_fl7_v3, the shipped production champion behind the CallZeke lists (180 features, FL-7, OO-strict + Individual-only). In May 2026 it was audited against the 7-failure-families roof-pipeline-audit doctrine — leakage, case-control prior, p≫n explosion, metric validity, selection-bias feedback, scope confusion. Six parallel findings (F1-F7). Bottom line: the champion is sound; one serious leakage risk (F3) is unresolved and blocks the one validated upgrade (V22.1) from shipping.

Detail: v2_model/README.md · V22.1_DEPLOYMENT.md (decision matrix) · V2_MULTISEED.md (3-way 3-seed) · F5_RESULTS.md

Seven-finding verdict

#FindingVerdictWhat it means
F1roof_age_months_est leakage (#1 feature, 829k gain)CLEANComputable from T0 snapshot alone. No fix needed.
F2area_* neighborhood-window featuresCLEANWindows respect T0 boundary. No fix needed.
F3FA snapshot vintage (Real_Estate_Master_V1 partition semantics)SERIOUS · BLOCKER10-40 mo forward-leak risk on 11 features incl. roof_cover_code (#2). Awaiting data-team confirmation of partition vintage. Blocks V22.1 promotion + F5 replacements.
F4canceled-roof family (row support)DENSIFYReplaced 3 sparse cols with dense canceled_roof_recent_12m bool. Lift-neutral (doesn't crack top-40). Shipped.
F5fips + nearest_storm_name memorizationPARTIAL · KEEPDropping both tightens cross-fold variance −13.8% but costs −1.6% mean lift. They encode time-varying real signal (hurricane corridors, county insurance regime), not pure memorization. Keep; replace with continuous (county_base_rate_36m, drop storm name keep physics).
F6roof-age redundancy (r=1.0 between #1 and #3)RETRACTEDTheory correct (correlation real) but dropping the 5 "redundant" cols cost 2-4% lift on multi-seed — LightGBM used them as marginal split candidates. Do not drop by correlation alone.
F7insurance-pressure design (renewal cycle)VALIDATEDF7-b roof_age_at_purchase_anniv_y12 ranks top-5 every seed; recovers ~0.7-0.8 lift pts. Ships as the V22.1 add. F7-a falsified, F7-c deferred.

V22.1 = v22 + F7-b · 6-fold cross-val (FL-7 OO-strict, eval anchors 2023-05 → 2025-10)

Statistical tie with a structural era-split. Mean lift@15K 8.40× (V22.1) vs 8.48× (v22), within seed noise. But hybrid wins all 3 older folds; v22 wins all 3 recent folds. Hybrid carries 14% tighter cross-fold variance (stdev 1.024 vs 1.184) — more predictable list quality. The recent-fold loss is the F3 leak surfacing: F7-b reads CurrentSaleRecordingDate, whose FA-commit-vs-T0 gap widens on recent anchors.

Eval T0v22 L@15KV22.1 L@15KΔEra
2025-10-31 (prod eval)9.68×9.51×−1.7%Recent · v22 +2.7%
2025-05-319.96×9.70×−2.6%
2024-11-308.75×8.41×−3.8%
2024-05-317.00×7.16×+2.3%Older · V22.1 +1.2%
2023-11-307.55×7.63×+1.0%
2023-05-317.97×7.99×+0.3%
mean8.48×8.40×−1.0%stdev 1.184 → 1.024 (−14%)

v2_oostrict (full F4+F6+F7) RETRACTED. 3-way 3-seed at anchor 2025-10-31: v22 9.60±0.15× · V22.1 9.46±0.08× · v2_oostrict 9.38±0.06× (L@15K). v2_oostrict has separated losses (−3.9% L@10K, −2.2% L@15K) — the F6 drops were net-negative. The earlier "v2 wins +12.6% AUC-PR vs v21" framing is a universe-mismatch artifact (FL-3, different filter version) and is superseded.

Trim feature-set · finding 85 (concurrent · super model)

Separate experiment on the FL-6/FL-7 super model, validated + committed (76f13b0). Tested whether "lifestyle" permits (POOL/DECK/GARAGE/WINDOWS/FENCE — owner-enjoyment capex as OO-proxy + renovate-then-sell signal) and FA garage cols help. The full 11-feat lifestyle set was near-flat; the 3-feature trim won: n_pool_permits_lifetime + garage_sqft + last_lifestyle_permit_months. Dropping the 11 noise feats freed LightGBM split capacity for the 3 orthogonal signals (noise-saturation effect).

ArmAUC-PR Δlift@5K Δlift@15K Δ
Lifestyle (11 feats)+3.4%flat
Garage (4 feats)+1.7%−0.5%
Trim (3 feats)+4.5%+6.8%+3.5%

Cross-fold confirmed (EVAL_T0=2026-04-30): trim 10.05× vs 9.32× base @15K (+7.8%), recall 17.8% → 19.2%. Feature ranks 3 / 10 / 25 of 201 (7.03% gain). Caveat: the win is N=15K-specific; the AUC-PR % is noise-inflated (287 eval positives). Open: FL-7 super_v3 go/no-go (~9h retrain). Detail: finding 85.

Production decision — stays v22 until F3 lands

  1. NOW Keep v22 in production. Recent folds (what next-quarter lists look like) favor v22 by 2.7%. V22.1 is spec'd + 6-fold validated as standby.
  2. BLOCKER F3 — resolve Real_Estate_Master_V1 partition vintage (data-team Q). Path A (vintage-frozen) → V22.1 promotes directly. Path B (current-as-of-commit) → add a 1-day vintage gate to the hybrid build.
  3. THEN Re-validate post-F3 — falsifiable: V22.1 wins ≥4/6 folds, mean ≥+1.5%. If it holds → promote to V22.1. A/B test v22 vs V22.1 on a real CallZeke build, measure 30/60/90-day conversion (lift-on-labels is a proxy; business outcome is truth).

References

Markdown files (AI-detail layer) — every link in the boxes above resolves to one of these:

FileStepContent
steps/01_variable_inventory.md1.1Variable inventory · role assignment (LABEL_SIGNAL / FEATURE / METADATA / DROP)
steps/02_type_subtype_mece.md1.2Type + subtype MECE rule table · priority cascade · anti-FP guards
data/gold_variable_inventory.md1.1.3Master per-column role table + per-FIPS null/cardinality matrix (output of 1.1)
steps/01_bz_roof_filter.md1.1 (archived)Pre-restructure binary classifier spec — kept read-only as historical reference
steps/02_permit_category_mece.md1.2 (archived)Pre-restructure 14-cat enum spec — kept read-only as historical reference
steps/04_repair_vs_replace.md1.3 (folded)Pre-restructure repair/replace sub-class — now folded into permit_subtype in 1.2
steps/05_labels_output_contract.md1.4Platinum parquet schema · spec_v versioning policy
steps/06_event_date_rule.md1.5Canonical event date · anti-corruption guard · T-3 anchor
steps/03_geographic_coverage.md2.xDecision tree, output schema, hard rule (false-negatives) · referenced from 2.1 / 2.2 / 2.4 / 2.5
coverage_rules/coverage_inclusion_rules.md2.3Canonical inclusion gates · sub-muni veto list
steps/12_as_of_enrichment.md3T0-relative anchor · as-of join method · leakage definition · repair/replace · hurricane — design locked 2026-05-21
steps/07_walk_forward_folds.md4H=6m · embargo=6m · cadence=6m · backward fold layout · case-control negatives — params locked 2026-05-21
steps/08_synthetic_features.md5Internal × External · feature budget · case-control-aware External split · interactions — audited 2026-05-21
steps/09_model_training.md6Global model + FIPS feature · algorithm · true-prior metrics · no re-weighting — audited 2026-05-21
steps/10_calibration.md7True-prior calibration · Platt default · global scope · ranker-only downgrade — audited 2026-05-21
steps/11_output_filter.md8Recent-roof gate as model-health detector · score never overwritten · VENDOR_BLIND · dedup · finding 67 fix — audited 2026-05-21
steps/13_deploy_monitor.md9Walk-forward backtest value metric (lift vs random) · permit-truth closed loop · feature drift · retraining — audited 2026-05-21
decisions/2026-05-20_labeling_before_coverage.mdADR · rationale for the 2026-05-20 renumber (labeling before coverage)
decisions/2026-05-20_step1_restructure_variables_and_type_subtype.md1.1 / 1.2ADR · 2026-05-20 Step 1 restructure into variables + type+subtype MECE · dev-layer strategy
decisions/2026-05-21_step3_t0_anchor.md3ADR · 2026-05-21 Step 3 T0-relative anchor · leakage definition · negatives · repair/replace · hurricane
INDEX.mdFolder map + status table mirror

Legacy pipeline HTML. The pre-renumber pipeline doc lives at notes/roofing_model_design/roofing_pipeline_scientific_method_2026-05-19.html. Marked ARCHIVED at the top of the doc; retained read-only as the historical sketch source for the Step 3-9 stubs above. Do not edit — migrate detail into the corresponding steps/*.md file as each phase becomes active.

Deep-dives · further reading

Documents that are not the source of truth for any pipeline step but that you should read to understand the data + ETL layers underneath the steps. Keep these handy when you want to dig past the "what" into the "how" and "why".

BuildZoom ETL walkthrough (2026-05-19)

The data's journey from raw provider CSVs through Bronze → Silver → Gold, including the FA-address match cascade that produces the rows our Step 1 classifier reads. Authoritative for understanding where the data we label actually comes from.

SectionWhat it coversRelevant to pipeline step
§1 ArchitectureSwim-lane view of the ETL · all four Bronze tables · joinsBackground for all steps
§2 BronzeRaw ingestion · four tables · one job eachUpstream of Step 1 (labels read Bronze-derived rows)
§3 SilverFact-table join · 3-join recipeUpstream of Step 1 + Step 3 enrichment
§4 GoldFA address matching · 3-condition cascadeDirect input to Step 1.1 + Step 2 match-rate
§4b Temporal contractevent_date + observation_period ruleDirect source for Step 1.5 canonical event date rule
§5 QA gateSame-shape audit at every layerPattern to mirror in Step 1 audits
§7 Issues identifiedBugs / surprises surfaced by the auditWorth scanning before every Step 1 / Step 2 change
§8 OpportunitiesProposed improvements (Bronze → Platinum)Several proposals (permit_category enum, status timeline, canonical event_date) are now Step 1 sub-tasks
§8b Missing cleaning layerProposal for a clean layer between Bronze and GoldRelevant to Step 1 + Step 3 enrichment design
§9 Where Platinum fitsWhy a separate tier — our Step 1.4 output IS the Platinum tier proposal materializedDirect source for Step 1.4 output contract
§11 Deep technical audit3 specialist passes: critical bugs · perf wins · silent-failure DQ issuesRead before any Step 1 spec change · 13 HIGH bugs catalogued

Other deep-dive references

DocWhat's in it
aws_operational_context.mdAWS ops rules (already linked from the AWS callout at the top of this page) · tag · partition pruning · EMR cloning · networking-don't-touch
findings_index.mdCross-reference to every notes/findings/*roof*.md entry — chronological evidence log
audits/Per-step empirical audit reports (e.g. 2026-05-20_v2_classifier_duval_audit.md · 25/25 ✓)
hypotheses/H1-H4 + experimental hypotheses (new owner × old roof · climate zone × material · hurricane wave · silver equity)
data/Field dictionaries (BZ, FA, providers, external macro)
callzeke/CallZeke client engagement archive (2026-05-13 call · 5 objections diagnosed 2026-05-15 · 9 distilled L1-L9 learnings for the roofing pipeline). Ex-REI hub HTMLs converted to AI-readable markdown 2026-05-25. Start at README.md00_context.mdobjections.mdlearnings.md.
Memory entriesproject_buildzoom_etl_audit · project_coverage_universe · project_coverage_verified · reference_aws_operational_context — agent-side facts persisted across sessions

Suggested reading order (if you're starting fresh)

  1. This cuaderno end-to-end (you are here) — get the 9-step macro.
  2. ETL §1-4 (architecture + Bronze + Silver + Gold) — understand where the rows we label come from.
  3. ETL §4b temporal contract — why event_date = MIN(non-null status dates).
  4. Step 1.1 + 1.2 MDs (steps/01_bz_roof_filter.md, 02_permit_category_mece.md) — the labeling rules in detail.
  5. ETL §11 deep audit — 13 HIGH bugs already catalogued; useful to know what's pending DE-side.
  6. AWS ops context — before touching any cluster.
  7. Step 2 (steps/03_geographic_coverage.md) + ADR — how coverage decides the training cohort.

v22 — How it works (field guide)

Migrated from v2_model/V22_EXPLAINED.html — the v22 model field guide as a changelog entry

8020ROOF
May 2026
Field Guide

How the model finds the next roof to replace.

A look inside the 163-feature, 2.27-million-row machine that scores every single-family home across seven Florida counties on its odds of pulling a roof permit in the next six months.

Most of the model's brain is the same brain a twenty-year Florida roofer carries to a job site. It looks at how old the roof is, what it's made of, whether the neighbors have been pulling permits, whether a hurricane recently came through, and whether the owner is in fix-things-up mode. The model just does it at 1.10 million properties at once instead of one driveway at a time. On a list of fifteen thousand homes, it catches roof replacements at 9.7× the rate of a random draw from that full scored universe on the current production anchor, and 8.5× averaged across six historical evaluation windows.

Driver clusters · LightGBM gain, aggregated by signal

What the model actually weights.

Roof age · 51%
Material · 10%
Place · 19%
Owner · 12%
$ · 8%
Approximate share of the top-40 features' gain, by signal cluster · 100% total
01
Roof age & lifecycle
≈ 51%
The dominant driver. Six features all encode "how long until this roof needs replacing" — directly (estimated roof age, months since the last whole-roof permit), as fallback (house age, year built), or as a learned threshold (the replacement-window flag). Highly correlated; individual rankings are not independent.
roof_age_months_est · roof_age_months · property_age_yr · year_built · months_since_any_roof · is_in_replacement_window
02
Roof material
≈ 10%
A single FA assessor field (shingle / tile / metal / flat) that interacts strongly with age. The model learns whatever per-material replacement rates exist in the training data; we have not independently audited those rates against industry conventions.
roof_cover_code
03
Place — county, neighborhood, storms
≈ 19%
Three nested location signals: county regime (insurance + climate, via the county code), subdivision herd effect (are neighbors reroofing?), and parcel-specific storm exposure (named storm, distance to track, peak wind). They overlap — Sarasota's high county reroof rate is partly the Ian effect — but encode distinct geographic scales.
fips · area_pct_wholeroofs_36m · nearest_storm_name · nearest_storm_km · nearest_storm_wind_kt · months_since_nearest_storm
04
Owner activity
≈ 12%
Recent permit history on this parcel for any non-roof trade — HVAC, plumbing, electrical, building, windows, solar, pool. The composite "any non-roof permit recently" carries the bulk of this signal; the model cannot pinpoint which specific trade matters most.
months_since_last_any_non_roof_permit · months_since_last_hvac_permit · months_since_last_building_permit · months_since_last_windows_permit
05
Capacity & value
≈ 8%
Can they pay, will the HOA enforce, what's the property worth. Physical size, market and AVM value, value per square foot, HOA tier. Smaller share but consistent across folds.
market_total_value · lot_sqft · building_sqft · avm_per_living_sqft · hoa1_fee_value
On feature counts. LightGBM reports gain per column. When multiple columns encode the same underlying signal — roof age in six different forms, for example — the gain is partitioned across them. Reading the model by individual feature rank overstates independent contribution. Reading by signal cluster is the honest version. Cluster shares above are computed over the top-40 features, which carry the bulk of total gain.
How the model finds the next roof
Field Guide · 1
8020ROOF
May 2026
Headline numbers

How well it actually works.

9.7×
Lift @ 15K · production anchor
eval at 2025-10-31
8.5×
Lift @ 15K · 6-fold mean
honest cross-period
13.2×
Lift @ 5K · top of list
tightest selection
1.15%
Universe base rate
6-mo permit hit rate

Of 1.10 million scored homes across the seven counties, only 12,631 (1.15%) will pull a roof permit in the next six months. A random 15,000 from that universe catches about 170. The model picks 15,000 and catches 1,665 — a 9.7× edge.

Which baseline? The 9.7× / 8.5× figures compare against a random draw from the full 1.10M scored universe. A client mails the narrower buy-box (single-family, age 10+, no roof permit in 10 years), which already screens out the obvious non-candidates — against that harder baseline the v21 reconciliation measured roughly 4–5×. Same model, smaller multiple; the denominator is what changes.

Cross-fold stability

Six historical windows, 7-to-10× lift.

10× 7.97× 2023·05 7.55× 2023·11 7.00× 2024·05 8.75× 2024·11 9.96× 2025·05 9.68× 2025·10*
Lift @ 15,000 across six walk-forward evaluation anchors. Strongest on 2025 anchors (Helene + Milton + Ian in-window); weakest on the quieter 2024-05 window. *Current production anchor. The lowest fold still beats random by 7×.
Audit findings

What we checked, changed, and left open.

Clean
Roof-age derivation and neighborhood-window features pass T0-strict leakage audit. No code changes required.
Validated
Insurance-renewal proximity (purchase-anniversary × roof age) ranks top-5 by gain — the validated v22.1 upgrade candidate, not yet in production v22. Continuous county base-rate is proposed (not yet built). One signal confirmed, one pending.
Validated
Three-feature "lifestyle trim" (pool-permit history, garage size, last improvement-permit recency) adds +3.5% lift@15K, confirmed cross-fold. Candidate for the next retrain.
Pending
FA assessor snapshot vintage — 10-to-40-month forward-leak risk on 11 features including the #2 driver. Awaiting data-team confirmation. Production stays on v22 until this resolves.
Closed
County + storm-name memorization ablation: partially validated. Variance tightened 14% but mean lift dropped 1.6% — categoricals carry real recent-storm signal, so they stay.
Anti-signals

What pulls a score down.

Removed before scoring
  • Absentee owner
  • Non-individual owner
  • Mobile / manufactured
  • Vacant or no mailable address
Roof says "wait"
  • Recently reroofed roof_age_months_est
  • Long-life tile roof_cover_code
  • Prior canceled permit months_since_canceled_roof
Quiet surroundings
  • Few neighbors reroofing area_pct_wholeroofs_36m
  • No recent storm months_since_nearest_storm
  • Low-rate county fips

Two stages. The first column is a hard filter — absentees, non-individual owners, mobile homes, vacants, missing EMV, and addresses that fail Smarty DPV are cut before the model ever scores. The other two columns are in-model signals whose adverse values pull a score down (directions inferred from feature behavior, not yet SHAP-audited). The 15-year recent-roof rule is a separate hard exclusion at the output gate.

Bottom line

Against a random draw from the scored universe the model is roughly nine-to-ten times more efficient across six historical windows (about 4–5× against the tighter buy-box a client actually mails). Physical roof age dominates (≈ 51% of signal), followed by place, owner activity, material, and capacity — mirroring the questions a twenty-year Florida roofer would ask. What the model can't see: insurance-carrier data, owner intent before permits surface, and any market outside the seven Florida counties it was trained on. Production stays on v22 until the FA-vintage question (the largest open lever) is resolved.

v22 super · FL-7 · 163 feat · build 5295029 · eval 2025-10-31
Field Guide · 2