REI pipeline
The engineering spine of Apollo — from the monthly First American snapshot to a calibrated, county-ranked sale-probability list that replaces Alpha.
Apollo predicts a single thing: the probability a property sells within the next six months (y_sold = 1 iff a sale is recorded in T0+1..T0+6). It is the supervised replacement for Alpha, 8020REI's hand-weighted 25-signal heuristic, at step 4 of the Gaia ETL. This page is the engineering process — seven stages, each with one owning source and a status pill. The conceptual field guide lives at How it works; the performance table lives in the Model card. The sibling roof-replacement pipeline is at Roofing pipeline.
Apollo is in active build (status: building). Stages below are the as-designed spine; most carry a SPEC/WIP pill. The locked March-2025 head-to-head test is untouched.
Built on a 5-county pilot set: 12086 Miami-Dade FL · 04013 Maricopa AZ · 42101 Philadelphia PA · 48201 Harris TX · 29095 Jackson MO. Feature tiers, horizon, T0 convention and the modeling ladder all trace to notes/MASTER_PLAN.md; the hard rules below are from CLAUDE.md.
YYYY-MM month-end boundary; every feature must be computable from the T0 snapshot alone — never future data.Internal property and owner state is already in the sandbox at (FIPS, PropertyId, Period) granularity, streamed from the First American silver layer at Hudi period/FIPS layout, deduped at read by QUALIFY ROW_NUMBER() … ORDER BY _hoodie_commit_time DESC = 1. The feature builder reads only the as-of-T0 columns (base_globs = _globs([t0], fips)), so a row's state cannot draw on snapshots dated after its own T0.
src/new_model/data.py · src/new_model/feature_cache.py · src/new_model/address_normalizer.py · MASTER_PLAN §2 (data availability)y_sold DONEy_sold = 1 iff any sale is recorded in the 6-month window (T0, T0+6], else 0. Earlier T0s stop at T_today − 6 months because later labels are not yet observable.This phase predicts ANY sale. The arms-length document-type filter is intentionally OFF — every sale event counts as y_sold = 1 (foreclosures, quit-claims, probate, divorce included). This is scope, not an oversight (CLAUDE.md hard rule #3); re-running with the filter is a planned second pass once Eduardo signs off. Do not read this as a bug.
src/new_model/features.py (label window) · MASTER_PLAN §3 (label & horizon) · CLAUDE.md hard rule #3| Tier | Family | Note |
|---|---|---|
| A | Property physical | Parcels, size, use type, year built. |
| B | Owner + distress | 23 distress trajectories, absentee level, leverage. The behavioral spine. |
| C | Valuation + activity | AVM, appreciation, days-ownership, listing activity. |
| D | Date-diffs | listing_duration_months, months_since_prev_sale, mortgage_age_months — UNDER LEAKAGE AUDIT. |
| E | National macro (FRED) | Mortgage rate, Fed funds, HPI, CPI, unemployment. Same value for every property within a T0 month. |
| F | Local market | BLS county unemployment, ACS income, FHFA state HPI. Being wired in. |
Three Tier-D features are under active leakage audit — listing_duration_months, months_since_prev_sale, mortgage_age_months (CLAUDE.md hard rule #7). Do not cite their contribution as validated until the ablation closes. The exact feature count differs across pages — see finding 45 and the model card for the curated set rather than asserting one number here.
src/new_model/features.py · src/new_model/macro.py (Tier E) · src/new_model/micro.py + src/new_model/demographics.py (Tier F) · MASTER_PLAN §4 · CLAUDE.md §7Horizon H = 6 months; the eval window is embargoed by horizon_months so train-label and eval windows do not overlap. Finding 32 caught the original layout leaking a 5-month label-window overlap (eval started one month after train-end with H=6); the default now pushes eval start to train_t0s[-1] + horizon + 1 month. Six walk-forward folds run on the 5-county set; the seventh fold is the locked March-2025 test.
src/new_model/walkforward.py (Fold defs, embargo, dedupe_contaminated_property_ids) · scripts/train_fold.py · MASTER_PLAN §3The modeling ladder (MASTER_PLAN §6) climbs three rungs — logistic regression (interpretable floor) → LightGBM (production candidate) → state-stratified / hierarchical — each with a go/no-go gate over Alpha. HistGB is the current default per the ablation; treat the matrix as the authority for which architecture wins each county rather than a single verdict here. Class weighting is none — the table arrives downsampled 1:K, and prior-correction back to the true base rate is the calibration step's job (stage 6).
scripts/train_fold_arch.py (lgbm / logistic / histgb / random_forest) · scripts/train_fold.py · MASTER_PLAN §6 (modeling ladder)A trained ranker is not a forecaster: raw boosted scores rank well but do not read as calibrated probabilities. Calibration maps scores → probabilities against a held-out, non-downsampled, true-prior slice so the prior-correction the downsampling deferred is applied here. The ±15% calibration tolerance is the macro-model win-condition target (MASTER_PLAN §12), not a measured result — Apollo is still in build.
src/new_model/features.py + scripts/train_fold.py (isotonic + prior-correction) · MASTER_PLAN §6 / §12Cross-county scores are not directly comparable — each county is ranked against itself, so a sidecar (per-county base rate / calibration context) is required before any cross-county comparison. The Alpha head-to-head is read at the top of each county's ranked list. The win condition — top-decile recall ≥ Alpha and ≥ Camilo, with 30/60/90-day calibration within ±15% — is a target on the locked March-2025 test, which remains untouched until Eduardo + Camilo sign off in writing (CLAUDE.md hard rule #4).
src/new_model/alpha.py (incumbent reproduction) · scripts/train_fold.py (per-county ranking) · MASTER_PLAN §12 (success criteria)Status, honestly
Apollo is building, not shipped. The GATED locked March-2025 head-to-head test is untouched until Eduardo + Camilo sign off in writing (hard rule #4). The win condition is a target, not a result. Three Tier-D date-diff features are under active leakage audit (hard rule #7). The arms-length filter is intentionally OFF this phase — scope, not a bug (hard rule #3). Stages 1–6 are active workstreams; stage 7's cross-county sidecar is SPEC. Read the Rules page for the hard business rules in full, and the REI overview for the model surface.
Spine from notes/MASTER_PLAN.md + src/new_model + CLAUDE.md conventions · REI findings 1–54 · status: building. Back to REI overview · Rules · sibling Roofing pipeline.