Roofing pipeline — end to end
The nine MECE steps that turn raw permits into a calibrated, gated roof-replacement lead list. Each step has a locked contract; each box below is faithful to its spec.
The roof-replacement model is built as nine mutually-exclusive, collectively-exhaustive steps: Labeling → Coverage → Enrichment → Folds → Features → Training → Calibration → Output gate → Deploy. Every step has one owning spec and a status pill. This is the engineering pipeline — the conceptual field guide lives at How it works; the upstream permit ETL is documented in the BuildZoom walkthrough.
Labeling is locked before coverage — deliberately. Coverage asks "what fraction of the SFH stock has at least one valid roof permit?" Its numerator depends entirely on the definition of "valid roof permit" — and that definition is the labeling step. So labeling has to be frozen first; coverage is strictly downstream of it (decided 2026-05-20).
Status, honestly. Steps 1-2 are active workstreams (Labeling DONE, Coverage WIP). Steps 3-9 are SPEC — designed and audited 2026-05-21 (per-step + integration), implementation not started. Step 3 has a completed 3-county pilot (Hernando / Pasco / Pinellas) but its national spec is still SPEC. Numbers below that come from that pilot are marked as such; do not read specs as shipped.
The nine steps
Step 1 · Labeling DONE
The foundation. Every permit is classified into a permit_scope — a list of (type, action) structs — so downstream code derives is_roofing / roof_action and never re-classifies. The positive label the model is trained on is a roof replacement: permit_action ∈ {REPLACEMENT, AMBIGUOUS}; REPAIR-only permits are not positives. The canonical event_date = MIN(non-null status dates) is capped at today + 90d to defuse future-date corruption. Labeling is locked first because coverage's numerator depends on this definition.
Full detail → Step 1 — Labeling and Permit classification. The taxonomy is documented there, not duplicated here.
Step 2 · Coverage WIP
For every (FIPS, jurisdiction, fa_muni) tuple, produce a coverage_decision ∈ {INCLUDE, EXCLUDE, FLAG}. Only INCLUDE tuples feed training and delivery. False negatives are the dangerous failure mode, so when in doubt, EXCLUDE. The decision is, per the locked Step 4 contract, a function of fold T0 — applying today's INCLUDE set to a past fold would leak future permits into that fold's universe.
Full detail → Step 2 — Coverage. The address-match mechanics that feed the diagnostic are in How we match.
Step 3 · Enrichment SPEC
Bottom line: build the enriched property-state table every training and scoring row reads — one row per (property_id, T0) pair, describing the property state as-of T0. This is where the leakage discipline lives.
The training unit is the pair (property_id, T0), not "a property with an event". The same T0 is shared by a positive and the negatives matched to it. Features describe the state as-of T0 plus trend anchors before it; the label (Step 4) is computed by looking forward from T0. The event date never sets the feature anchor — it is only ever read forward to set the label.
Anchor set
| Anchor | Role |
|---|---|
| T0 | Primary — level state at the prediction moment. |
| T0−3 | Short trend, Δ(T0, T0−3). |
| T0−6 | Medium trend, Δ(T0, T0−6). |
| T0−12 (optional) | Long trend / seasonal baseline. Drop if permutation importance is flat. |
Two halves, both reading ≤ T0
- Local — FA physical (year_built, property_age, derived
roof_age, living / lot / building sqft) plus prior-permit history self-joined strictly ≤ T0 (last_ROOFING_age,n_roof_REPAIR_24m,n_HVAC_24m,n_SOLAR_24m). - Behavioral — the FA / Silver REM feature vocabulary (the data that fed Apollo, not the Apollo model): 23 distress trajectories + count, owner / occupancy, point-in-time valuation, leverage (CLTV / LTV), recency.
The leakage rule (locked). "Past data" is not automatically safe data. A feature is leaky if it is dated after that row's T0, even when it is still in the past relative to today. Permit features are filtered to event_date ≤ T0. The earlier "T-1 too close" framing was wrong: a refi one month pre-permit is a real, serve-time-available signal — not leakage. True leakage is a post-T0 feature or an FA snapshot vintage backfill. There is no artificial freshest-month exclusion.
Open / under audit. The FA vintage audit (are monthly snapshots point-in-time, no backfill?) is still open — treat near-T0 features with caution until verified. Three reused features — months_since_last_sale, listing_duration_months, mortgage_age_months — are under the CLAUDE.md §7 leakage audit and inherit it here; do not cite their contribution as validated until the ablation closes.
The 3-county pilot (Hernando / Pasco / Pinellas, 2026-05-21) enriched 497,856 coverage-INCLUDE SFH across 7 fold anchors (2022-05 → 2025-05), leakage-clean with a 0-violation assertion on event_date ≤ T0. REM-matched 96–99%; any-distress 6–10%; 70–76% carry a prior whole-roof permit; median AVM $396K. National implementation has not started.
Spec: notes/Roofing/steps/12_as_of_enrichment.md
Step 4 · Folds & negative sampling SPEC
Bottom line: turn the enriched (property_id, T0) rows into walk-forward train / eval folds with case-control negative sampling — laid out from the most recent data backward so no fold ever trains on its own future.
Locked parameters (2026-05-21)
| Lever | Value | Note |
|---|---|---|
| Horizon H | 6 months | Label window (T0, T0+6] — matches the reroof process (~3-6 mo) plus the post-hurricane permit wave. |
| Embargo | 6 months (= H) | Gap between newest train label window and the eval window. Must equal H — a shorter embargo (the prior 3-month value) left label windows overlapping; that was a bug. |
| Fold cadence | 6 months | Eval windows step back 6 mo → disjoint, non-overlapping → honest variance estimate. |
| Negatives per positive (K) | TBD | Measure the base rate first. Matched on FIPS + T0. K recorded in the manifest for Step 7 prior-correction. |
Why walk-forward
Each fold simulates a real prediction made at its eval anchor E_x = latest_obs − 6·x, training on an expanding window of rows whose label windows end at or before E_x — realising the 6-month embargo. Eval windows are disjoint and adjacent, so each fold is an independent estimate. The sandbox (2021-01 onward) yields roughly 6-8 folds; fold x=1 is the most production-like. latest_obs itself is not a validatable fold (its label window runs past the data) — it is the production scoring point.
Case-control negative sampling
- A negative = a
(property_id, T0)pair with no qualifying roof-replacement event in(T0, T0+6]. - Per positive, draw K negatives at the same T0 and FIPS — removes county and calendar confounders. Do not match on anything that is also a predictive feature (roof age, distress) — that nulls the signal.
- The negative pool is the full coverage-INCLUDE SFH universe, evaluated as of each fold's own T0 (not one static national set).
- A property supplies positive rows for the T0s leading to its event and negative rows for other T0s — a discrete-time panel. Dedup is enforced with a ±embargo so a row never appears on both sides of a fold boundary.
Baseline = a random list of equal size. The held-out evaluation (Step 9) measures lift over a random list on a walk-forward backtest. Roofing does not compete against Alpha — Alpha is the REI sales heuristic, a different model with a different target; the "March 2025 vs Alpha" gate in CLAUDE.md belongs to the macro sales model, not here. Open item: pick a backtest window old enough that its permit data is fully reported.
Spec: notes/Roofing/steps/07_walk_forward_folds.md
Step 5 · Features SPEC
Bottom line: build clean, leakage-free, low-dimensional columns on the Step 3 anchors — two families, Internal (property + owner) and External (market + environment) — and let the gradient-boosted model learn the interactions itself.
Internal × external families
| Internal (property-keyed, as-of T0) | External (split by within-cell variability) |
|---|---|
Property physical — year_built, property_age, derived roof_age, living_area, use_type, roof_material. | Useful — varies within (FIPS, T0): sub-county neighbour permit density, parcel-level storm exposure, parcel / sub-county insurance-claim signal. |
| Owner state — tenure_months, is_llc, months_since_last_sale, sales_in_prev_24m. | Wasted for within-cell ranking — constant within (FIPS, T0): county-level macro (FRED mortgage rate, fed funds, HPI, CPI, unemployment), county-level "hurricane happened" flag. These do not go in the model — case-control sampling removes their association with the label. They feed the base rate at calibration / threshold tuning (Step 7 / 9). |
Distress — collapsed to distress_count_active + one trend + a short list of the most-predictive flags, not all 23 × 4 stats × 3 anchors. | |
Valuation / equity — AVM, CLTV, equity_$. · Prior permit history (self-join ≤ T0) — last_ROOFING_age, n_roof_REPAIR_24m, n_HVAC_24m, n_SOLAR_24m. |
Selection — expert pre-filter, then let the model judge
- Expert pre-filter (floor only) — drop obvious non-signal (IDs, free text) and no-variability columns (constant, >99% one value, >99% null). Do not pre-drop the "plausible but I doubt it" bucket.
- Permutation importance — train a fast first model, rank columns by how much the metric degrades when each is shuffled, drop importance ≈ 0, retrain. This is multivariate — it captures interactions, unlike univariate correlation filtering.
- The distress family (23 trajectories × 4 stats × 3 anchors ≈ 276 columns) is the main explosion risk — collapse it at the pre-filter, and drop the 2nd-diff "acceleration" for the first model.
Two locked feature rules. (1) The replacement-window / roof-age signal is central: roof_age alone is weak, but roof_age × months_since_hurricane (old roof + recent storm) is strong — which is exactly why selection must be multivariate, not univariate. (2) Neighbour permit density is a leakage trap — the rolling window must end at T0, exclude the subject property, and count only other parcels' permits; a window crossing T0 leaks the label. Hurricane / storm exposure must be parcel- or sub-county-granular to carry signal under T0-matched case-control; a county-month flag is wasted. The highest-value open gap is a sub-county insurance-claim feed (source TBD). Skip hand-coded interactions for the first model — the trees learn them.
Spec: notes/Roofing/steps/08_synthetic_features.md
Step 6 · Training SPEC
Bottom line: train one global gradient-boosted model that ranks properties by roof-replacement probability over the next 6 months — with FIPS as a feature, not a per-county model.
Architecture defaults
| Choice | Default | Why |
|---|---|---|
| Algorithm | LightGBM | Handles a high-cardinality FIPS categorical natively. Confirm against HistGB / logistic on a roofing-specific arch sweep — the prior arch matrix was the sales model; do not assume HistGB wins here. |
| Scope | Global + FIPS feature | Reroof is a rare event; a per-FIPS model on ~100-300 positives re-learns "old roof matters" from thin data. A GBM carves FIPS-specific behaviour where rows allow and falls back to the global pattern otherwise — automatic partial pooling. |
| Class weighting | None | The training table arrives downsampled 1:K from Step 4. Step 6 does not re-sample or re-weight — prior correction back to the true base rate is Step 7's job. |
| Early stopping | Validation slice carved from train | Time-separated with embargo — never the fold eval window (that would tune to the eval metric). |
| Hyperparameters | Tune once on pooled data | Fix and reuse. No fresh 50-trial search per FIPS — that overfits thin per-FIPS data. |
True-prior handling. The Step 4 eval set is case-control 1:K, so its base rate is not the real-world rate. AUC-PR read off it is inflated; top-decile recall (the business metric) must be computed on a holdout with the true population base rate, not the downsampled eval. AUC-ROC is base-rate-invariant and useful for cross-fold comparison. A per-FIPS carve-out is justified only when evidence beats the global+FIPS model for that metro — the row-count threshold is an open question.
Spec: notes/Roofing/steps/09_model_training.md
Step 7 · Calibration SPEC
Bottom line: turn the raw ranker scores into probabilities that mean what they claim — "this property has an X% chance of a roof replacement in the next 6 months" — corrected back to the true base rate.
A trained ranker is not a forecaster: raw GBM scores rank well but do not read as true probabilities. Calibration maps scores → probabilities against held-out outcomes.
Method defaults
| Choice | Default | Why |
|---|---|---|
| Method | Platt (sigmoid) | 2 parameters, stable on thin data; a well-trained GBM's miscalibration is largely monotonic. Isotonic is data-hungry and overfits small samples — use only where the calibration sample is large. |
| Scope | Global | Matches the global model. Per-FIPS isotonic on ~1,421 counties starves on data; split by region only if markets genuinely differ. |
| Hold-out | True-prior sample | Drawn from the full population at its real positive rate — not a slice of the 1:K training pool. Recent, time-separated from eval and the held-out final test. |
Critical — calibrate against the TRUE base rate. The model's raw scores reflect a sample prior of ~1/(K+1), not the rare real-world rate. Calibrating against a slice of the 1:K pool produces probabilities wrong by one to two orders of magnitude — "23%" when the truth is ~0.3%. This is the prior-correction Steps 4 (record K) and 6 defer here: either calibrate on a true-prior hold-out, or apply an explicit log-odds correction logit(p_true) = logit(p_model) + log(π_true/(1−π_true)) − log(π_sample/(1−π_sample)). π_true must be measured, not assumed. If a region fails calibration, downgrade it to "ranker only — no probability claim".
Tolerance is open. The ±15% figure in the legacy sketch came from the macro sales model's win condition — it is not automatically the roofing tolerance. The roofing tolerance (value, and relative vs absolute framing) is an open item to set explicitly.
Spec: notes/Roofing/steps/10_calibration.md
Step 8 · Output gate SPEC
Bottom line: a sanity gate over the calibrated list that excludes recently-rerofed and vendor-blind properties, dedups to one mailpiece per household — and, by design, doubles as a model-health detector.
Hard rules
| Rule | Threshold | Effect |
|---|---|---|
| Recent-roof exclusion | last_ROOFING_age < 180m (15 yr) | delivery_eligible = false, reason RECENT_ROOF. A conservative universal delivery floor — a roof with useful life left is not a good lead. |
| Very-recent alarm band | < ~36m (≈3 yr) + high score | Flag for model-health review — the sharpest model-error tell. Tracked separately from the 15-yr exclusion. |
| VENDOR_BLIND | coverage_decision ≠ INCLUDE | delivery_eligible = false, reason VENDOR_BLIND (Step 2 output). |
| Dedup | one mailpiece per household | Over delivery-eligible rows only. Dedup unit (household vs absentee multi-property owner) is an open question. |
The gate is a model-health detector — and never overwrites the score. The 15-year exclusion serves two purposes at once: it filters the delivery list, and a rising count of high-scored recently-rerofed properties means the model has stopped using roof_age correctly — fix the model (Step 5 / 6), not the gate. So the gate's firing rate is a monitored metric (Step 9), not silent plumbing. calibrated_probability is kept intact for every row, always — the gate only sets delivery_eligible and delivery_reason_excluded. Overwriting the score to 0 / NULL would erase exactly the evidence that the catch ("model said 0.8, gate excluded for RECENT_ROOF") provides.
Spec: notes/Roofing/steps/11_output_filter.md
Step 9 · Deploy + monitor SPEC
Bottom line: put the model into production, prove its value with a walk-forward backtest, monitor fast leading signals and slow ground truth separately, and retrain on a hybrid 90-day-floor / fast-signal-ceiling cadence.
The value metric
| Metric | Definition | Why |
|---|---|---|
| Recall @ list size N | of all roof replacements in the window, the fraction on our list of N | the capture rate the client cares about |
| Lift over random | recall(model, N) ÷ recall(random, N) | the inflation-proof headline — recall alone is gamed by enlarging the list |
| Precision @ N | of the N delivered, the fraction that reroofed | the client works the list — wasted mailers matter |
| Gains curve | recall and precision across a sweep of N | one chart; the client picks N to their capacity |
The client-facing proof is one walk-forward fold told as a story: stand at a past T0, train only on data ≤ T0, generate the list we would have delivered, then check what actually happened over (T0, T0+6]. Honest only with T0 discipline (reconstruct what we actually would have produced, not a better label spec from today) and a settled window (permits lag; a too-recent window undercounts and makes recall look worse than it is). The baseline is a random list of equal size.
Monitoring — fast vs slow
- Fast / leading (daily) — input-feature drift (PSI), score-distribution drift per region, and the output-gate firing rate (RECENT_ROOF + very-recent band; a rising count = model regression).
- Slow / lagging (per cycle) — recall / precision / lift vs new permit data, once the 6-month window resolves. Ground truth is the roof permit, not the CRM (a property can reroof without becoming a CRM deal).
- Score the full population each cycle — not just the delivered list — to avoid a closed feedback loop where the model reinforces its own blind spots.
Hybrid retraining. Floor — retrain every 90 days minimum (slow drift not caught daily). Ceiling — retrain when a fast leading signal fires: input-feature PSI > 0.25 or a sustained gate-firing-rate rise. Precision is a lagging confirmation (resolves ~6 months late), not a weekly trigger. Whichever fires first triggers a Step 6 re-run; Steps 1-5 re-run only on their own triggers.
Spec: notes/Roofing/steps/13_deploy_monitor.md
Related
How it works
The conceptual field guide — the why, in plain language.
Model card
Scope, metrics, intended use, and limitations.
Rules
The hard business rules the delivery list obeys.
Rendered from the 9-step spine in notes/Roofing/PROGRESS_NOTEBOOK.html and the step specs notes/Roofing/steps/07–13 · status pills per the notebook (Steps 1-2 active, Steps 3-9 SPEC, audited 2026-05-21) · pilot numbers from the 3-county Hernando / Pasco / Pinellas enrichment run.