Overview › Models › Roofing › Pipeline

Roofing pipeline — end to end

The nine MECE steps that turn raw permits into a calibrated, gated roof-replacement lead list. Each step has a locked contract; each box below is faithful to its spec.

The roof-replacement model is built as nine mutually-exclusive, collectively-exhaustive steps: Labeling → Coverage → Enrichment → Folds → Features → Training → Calibration → Output gate → Deploy. Every step has one owning spec and a status pill. This is the engineering pipeline — the conceptual field guide lives at How it works; the upstream permit ETL is documented in the BuildZoom walkthrough.

Labeling is locked before coverage — deliberately. Coverage asks "what fraction of the SFH stock has at least one valid roof permit?" Its numerator depends entirely on the definition of "valid roof permit" — and that definition is the labeling step. So labeling has to be frozen first; coverage is strictly downstream of it (decided 2026-05-20).

Status, honestly. Steps 1-2 are active workstreams (Labeling DONE, Coverage WIP). Steps 3-9 are SPEC — designed and audited 2026-05-21 (per-step + integration), implementation not started. Step 3 has a completed 3-county pilot (Hernando / Pasco / Pinellas) but its national spec is still SPEC. Numbers below that come from that pilot are marked as such; do not read specs as shipped.

The nine steps

STEP 1Labeling DONEPermit → permit_scope. Defines "valid roof permit". → /data STEP 2Coverage WIPPer-muni INCLUDE / EXCLUDE / FLAG. Where can we see? → /data STEP 3Enrichment SPECAs-of-T0 property state. Where the leakage discipline lives. STEP 4Folds SPECWalk-forward folds + case-control negative sampling. STEP 5Features SPECInternal × external columns; let the model judge. STEP 6Training SPECOne global GBM with FIPS as a feature. STEP 7Calibration SPECRanker → forecaster, corrected to the true base rate. STEP 8Output gate SPECRecent-roof exclusion; never overwrite the score. STEP 9Deploy + monitor SPECBacktest, value metric, drift, hybrid retraining.

Step 1 · Labeling DONE

The foundation. Every permit is classified into a permit_scope — a list of (type, action) structs — so downstream code derives is_roofing / roof_action and never re-classifies. The positive label the model is trained on is a roof replacement: permit_action ∈ {REPLACEMENT, AMBIGUOUS}; REPAIR-only permits are not positives. The canonical event_date = MIN(non-null status dates) is capped at today + 90d to defuse future-date corruption. Labeling is locked first because coverage's numerator depends on this definition.

Full detail → Step 1 — Labeling and Permit classification. The taxonomy is documented there, not duplicated here.

Step 2 · Coverage WIP

For every (FIPS, jurisdiction, fa_muni) tuple, produce a coverage_decision ∈ {INCLUDE, EXCLUDE, FLAG}. Only INCLUDE tuples feed training and delivery. False negatives are the dangerous failure mode, so when in doubt, EXCLUDE. The decision is, per the locked Step 4 contract, a function of fold T0 — applying today's INCLUDE set to a past fold would leak future permits into that fold's universe.

Full detail → Step 2 — Coverage. The address-match mechanics that feed the diagnostic are in How we match.

Step 3 · Enrichment SPEC

Bottom line: build the enriched property-state table every training and scoring row reads — one row per (property_id, T0) pair, describing the property state as-of T0. This is where the leakage discipline lives.

The training unit is the pair (property_id, T0), not "a property with an event". The same T0 is shared by a positive and the negatives matched to it. Features describe the state as-of T0 plus trend anchors before it; the label (Step 4) is computed by looking forward from T0. The event date never sets the feature anchor — it is only ever read forward to set the label.

Anchor set

Anchor	Role
T0	Primary — level state at the prediction moment.
T0−3	Short trend, Δ(T0, T0−3).
T0−6	Medium trend, Δ(T0, T0−6).
T0−12 (optional)	Long trend / seasonal baseline. Drop if permutation importance is flat.

Two halves, both reading ≤ T0

Local — FA physical (year_built, property_age, derived roof_age, living / lot / building sqft) plus prior-permit history self-joined strictly ≤ T0 (last_ROOFING_age, n_roof_REPAIR_24m, n_HVAC_24m, n_SOLAR_24m).
Behavioral — the FA / Silver REM feature vocabulary (the data that fed Apollo, not the Apollo model): 23 distress trajectories + count, owner / occupancy, point-in-time valuation, leverage (CLTV / LTV), recency.

The leakage rule (locked). "Past data" is not automatically safe data. A feature is leaky if it is dated after that row's T0, even when it is still in the past relative to today. Permit features are filtered to event_date ≤ T0. The earlier "T-1 too close" framing was wrong: a refi one month pre-permit is a real, serve-time-available signal — not leakage. True leakage is a post-T0 feature or an FA snapshot vintage backfill. There is no artificial freshest-month exclusion.

Open / under audit. The FA vintage audit (are monthly snapshots point-in-time, no backfill?) is still open — treat near-T0 features with caution until verified. Three reused features — months_since_last_sale, listing_duration_months, mortgage_age_months — are under the CLAUDE.md §7 leakage audit and inherit it here; do not cite their contribution as validated until the ablation closes.

The 3-county pilot (Hernando / Pasco / Pinellas, 2026-05-21) enriched 497,856 coverage-INCLUDE SFH across 7 fold anchors (2022-05 → 2025-05), leakage-clean with a 0-violation assertion on event_date ≤ T0. REM-matched 96–99%; any-distress 6–10%; 70–76% carry a prior whole-roof permit; median AVM $396K. National implementation has not started.

Spec: notes/Roofing/steps/12_as_of_enrichment.md

Step 4 · Folds & negative sampling SPEC

Bottom line: turn the enriched (property_id, T0) rows into walk-forward train / eval folds with case-control negative sampling — laid out from the most recent data backward so no fold ever trains on its own future.

Locked parameters (2026-05-21)

Lever	Value	Note
Horizon H	6 months	Label window `(T0, T0+6]` — matches the reroof process (~3-6 mo) plus the post-hurricane permit wave.
Embargo	6 months (= H)	Gap between newest train label window and the eval window. Must equal H — a shorter embargo (the prior 3-month value) left label windows overlapping; that was a bug.
Fold cadence	6 months	Eval windows step back 6 mo → disjoint, non-overlapping → honest variance estimate.
Negatives per positive (K)	TBD	Measure the base rate first. Matched on FIPS + T0. K recorded in the manifest for Step 7 prior-correction.

Why walk-forward

Each fold simulates a real prediction made at its eval anchor E_x = latest_obs − 6·x, training on an expanding window of rows whose label windows end at or before E_x — realising the 6-month embargo. Eval windows are disjoint and adjacent, so each fold is an independent estimate. The sandbox (2021-01 onward) yields roughly 6-8 folds; fold x=1 is the most production-like. latest_obs itself is not a validatable fold (its label window runs past the data) — it is the production scoring point.

Case-control negative sampling

A negative = a (property_id, T0) pair with no qualifying roof-replacement event in (T0, T0+6].
Per positive, draw K negatives at the same T0 and FIPS — removes county and calendar confounders. Do not match on anything that is also a predictive feature (roof age, distress) — that nulls the signal.
The negative pool is the full coverage-INCLUDE SFH universe, evaluated as of each fold's own T0 (not one static national set).
A property supplies positive rows for the T0s leading to its event and negative rows for other T0s — a discrete-time panel. Dedup is enforced with a ±embargo so a row never appears on both sides of a fold boundary.

Baseline = a random list of equal size. The held-out evaluation (Step 9) measures lift over a random list on a walk-forward backtest. Roofing does not compete against Alpha — Alpha is the REI sales heuristic, a different model with a different target; the "March 2025 vs Alpha" gate in CLAUDE.md belongs to the macro sales model, not here. Open item: pick a backtest window old enough that its permit data is fully reported.

Spec: notes/Roofing/steps/07_walk_forward_folds.md

Step 5 · Features SPEC

Bottom line: build clean, leakage-free, low-dimensional columns on the Step 3 anchors — two families, Internal (property + owner) and External (market + environment) — and let the gradient-boosted model learn the interactions itself.

Internal × external families

Internal (property-keyed, as-of T0)	External (split by within-cell variability)
Property physical — year_built, property_age, derived `roof_age`, living_area, use_type, roof_material.	Useful — varies within (FIPS, T0): sub-county neighbour permit density, parcel-level storm exposure, parcel / sub-county insurance-claim signal.
Owner state — tenure_months, is_llc, months_since_last_sale, sales_in_prev_24m.	Wasted for within-cell ranking — constant within (FIPS, T0): county-level macro (FRED mortgage rate, fed funds, HPI, CPI, unemployment), county-level "hurricane happened" flag. These do not go in the model — case-control sampling removes their association with the label. They feed the base rate at calibration / threshold tuning (Step 7 / 9).
Distress — collapsed to `distress_count_active` + one trend + a short list of the most-predictive flags, not all 23 × 4 stats × 3 anchors.
Valuation / equity — AVM, CLTV, equity_$. · Prior permit history (self-join ≤ T0) — `last_ROOFING_age`, `n_roof_REPAIR_24m`, `n_HVAC_24m`, `n_SOLAR_24m`.

Selection — expert pre-filter, then let the model judge

Expert pre-filter (floor only) — drop obvious non-signal (IDs, free text) and no-variability columns (constant, >99% one value, >99% null). Do not pre-drop the "plausible but I doubt it" bucket.
Permutation importance — train a fast first model, rank columns by how much the metric degrades when each is shuffled, drop importance ≈ 0, retrain. This is multivariate — it captures interactions, unlike univariate correlation filtering.
The distress family (23 trajectories × 4 stats × 3 anchors ≈ 276 columns) is the main explosion risk — collapse it at the pre-filter, and drop the 2nd-diff "acceleration" for the first model.

Two locked feature rules. (1) The replacement-window / roof-age signal is central: roof_age alone is weak, but roof_age × months_since_hurricane (old roof + recent storm) is strong — which is exactly why selection must be multivariate, not univariate. (2) Neighbour permit density is a leakage trap — the rolling window must end at T0, exclude the subject property, and count only other parcels' permits; a window crossing T0 leaks the label. Hurricane / storm exposure must be parcel- or sub-county-granular to carry signal under T0-matched case-control; a county-month flag is wasted. The highest-value open gap is a sub-county insurance-claim feed (source TBD). Skip hand-coded interactions for the first model — the trees learn them.

Spec: notes/Roofing/steps/08_synthetic_features.md

Step 6 · Training SPEC

Bottom line: train one global gradient-boosted model that ranks properties by roof-replacement probability over the next 6 months — with FIPS as a feature, not a per-county model.

Architecture defaults

Choice	Default	Why
Algorithm	LightGBM	Handles a high-cardinality `FIPS` categorical natively. Confirm against HistGB / logistic on a roofing-specific arch sweep — the prior arch matrix was the sales model; do not assume HistGB wins here.
Scope	Global + FIPS feature	Reroof is a rare event; a per-FIPS model on ~100-300 positives re-learns "old roof matters" from thin data. A GBM carves FIPS-specific behaviour where rows allow and falls back to the global pattern otherwise — automatic partial pooling.
Class weighting	None	The training table arrives downsampled 1:K from Step 4. Step 6 does not re-sample or re-weight — prior correction back to the true base rate is Step 7's job.
Early stopping	Validation slice carved from train	Time-separated with embargo — never the fold eval window (that would tune to the eval metric).
Hyperparameters	Tune once on pooled data	Fix and reuse. No fresh 50-trial search per FIPS — that overfits thin per-FIPS data.

True-prior handling. The Step 4 eval set is case-control 1:K, so its base rate is not the real-world rate. AUC-PR read off it is inflated; top-decile recall (the business metric) must be computed on a holdout with the true population base rate, not the downsampled eval. AUC-ROC is base-rate-invariant and useful for cross-fold comparison. A per-FIPS carve-out is justified only when evidence beats the global+FIPS model for that metro — the row-count threshold is an open question.

Spec: notes/Roofing/steps/09_model_training.md

Step 7 · Calibration SPEC

Bottom line: turn the raw ranker scores into probabilities that mean what they claim — "this property has an X% chance of a roof replacement in the next 6 months" — corrected back to the true base rate.

A trained ranker is not a forecaster: raw GBM scores rank well but do not read as true probabilities. Calibration maps scores → probabilities against held-out outcomes.

Method defaults

Choice	Default	Why
Method	Platt (sigmoid)	2 parameters, stable on thin data; a well-trained GBM's miscalibration is largely monotonic. Isotonic is data-hungry and overfits small samples — use only where the calibration sample is large.
Scope	Global	Matches the global model. Per-FIPS isotonic on ~1,421 counties starves on data; split by region only if markets genuinely differ.
Hold-out	True-prior sample	Drawn from the full population at its real positive rate — not a slice of the 1:K training pool. Recent, time-separated from eval and the held-out final test.

Critical — calibrate against the TRUE base rate. The model's raw scores reflect a sample prior of ~1/(K+1), not the rare real-world rate. Calibrating against a slice of the 1:K pool produces probabilities wrong by one to two orders of magnitude — "23%" when the truth is ~0.3%. This is the prior-correction Steps 4 (record K) and 6 defer here: either calibrate on a true-prior hold-out, or apply an explicit log-odds correction logit(p_true) = logit(p_model) + log(π_true/(1−π_true)) − log(π_sample/(1−π_sample)). π_true must be measured, not assumed. If a region fails calibration, downgrade it to "ranker only — no probability claim".

Tolerance is open. The ±15% figure in the legacy sketch came from the macro sales model's win condition — it is not automatically the roofing tolerance. The roofing tolerance (value, and relative vs absolute framing) is an open item to set explicitly.

Spec: notes/Roofing/steps/10_calibration.md

Step 8 · Output gate SPEC

Bottom line: a sanity gate over the calibrated list that excludes recently-rerofed and vendor-blind properties, dedups to one mailpiece per household — and, by design, doubles as a model-health detector.

Hard rules

Rule	Threshold	Effect
Recent-roof exclusion	last_ROOFING_age < 180m (15 yr)	`delivery_eligible = false`, reason `RECENT_ROOF`. A conservative universal delivery floor — a roof with useful life left is not a good lead.
Very-recent alarm band	< ~36m (≈3 yr) + high score	Flag for model-health review — the sharpest model-error tell. Tracked separately from the 15-yr exclusion.
VENDOR_BLIND	coverage_decision ≠ INCLUDE	`delivery_eligible = false`, reason `VENDOR_BLIND` (Step 2 output).
Dedup	one mailpiece per household	Over delivery-eligible rows only. Dedup unit (household vs absentee multi-property owner) is an open question.

The gate is a model-health detector — and never overwrites the score. The 15-year exclusion serves two purposes at once: it filters the delivery list, and a rising count of high-scored recently-rerofed properties means the model has stopped using roof_age correctly — fix the model (Step 5 / 6), not the gate. So the gate's firing rate is a monitored metric (Step 9), not silent plumbing. calibrated_probability is kept intact for every row, always — the gate only sets delivery_eligible and delivery_reason_excluded. Overwriting the score to 0 / NULL would erase exactly the evidence that the catch ("model said 0.8, gate excluded for RECENT_ROOF") provides.

Spec: notes/Roofing/steps/11_output_filter.md

Step 9 · Deploy + monitor SPEC

Bottom line: put the model into production, prove its value with a walk-forward backtest, monitor fast leading signals and slow ground truth separately, and retrain on a hybrid 90-day-floor / fast-signal-ceiling cadence.

The value metric

Metric	Definition	Why
Recall @ list size N	of all roof replacements in the window, the fraction on our list of N	the capture rate the client cares about
Lift over random	recall(model, N) ÷ recall(random, N)	the inflation-proof headline — recall alone is gamed by enlarging the list
Precision @ N	of the N delivered, the fraction that reroofed	the client works the list — wasted mailers matter
Gains curve	recall and precision across a sweep of N	one chart; the client picks N to their capacity

The client-facing proof is one walk-forward fold told as a story: stand at a past T0, train only on data ≤ T0, generate the list we would have delivered, then check what actually happened over (T0, T0+6]. Honest only with T0 discipline (reconstruct what we actually would have produced, not a better label spec from today) and a settled window (permits lag; a too-recent window undercounts and makes recall look worse than it is). The baseline is a random list of equal size.

Monitoring — fast vs slow

Fast / leading (daily) — input-feature drift (PSI), score-distribution drift per region, and the output-gate firing rate (RECENT_ROOF + very-recent band; a rising count = model regression).
Slow / lagging (per cycle) — recall / precision / lift vs new permit data, once the 6-month window resolves. Ground truth is the roof permit, not the CRM (a property can reroof without becoming a CRM deal).
Score the full population each cycle — not just the delivered list — to avoid a closed feedback loop where the model reinforces its own blind spots.

Hybrid retraining. Floor — retrain every 90 days minimum (slow drift not caught daily). Ceiling — retrain when a fast leading signal fires: input-feature PSI > 0.25 or a sustained gate-firing-rate rise. Precision is a lagging confirmation (resolves ~6 months late), not a weekly trigger. Whichever fires first triggers a Step 6 re-run; Steps 1-5 re-run only on their own triggers.

Spec: notes/Roofing/steps/13_deploy_monitor.md

How it works

The conceptual field guide — the why, in plain language.

Model card

Scope, metrics, intended use, and limitations.

Rules

The hard business rules the delivery list obeys.

Rendered from the 9-step spine in notes/Roofing/PROGRESS_NOTEBOOK.html and the step specs notes/Roofing/steps/07–13 · status pills per the notebook (Steps 1-2 active, Steps 3-9 SPEC, audited 2026-05-21) · pilot numbers from the 3-county Hernando / Pasco / Pinellas enrichment run.