Overview › Models › REI › Pipeline

REI pipeline

The engineering spine of Apollo — from the monthly First American snapshot to a calibrated, county-ranked sale-probability list that replaces Alpha.

Apollo predicts a single thing: the probability a property sells within the next six months (y_sold = 1 iff a sale is recorded in T0+1..T0+6). It is the supervised replacement for Alpha, 8020REI's hand-weighted 25-signal heuristic, at step 4 of the Gaia ETL. This page is the engineering process — seven stages, each with one owning source and a status pill. The conceptual field guide lives at How it works; the performance table lives in the Model card. The sibling roof-replacement pipeline is at Roofing pipeline.

Apollo is in active build (status: building). Stages below are the as-designed spine; most carry a SPEC/WIP pill. The locked March-2025 head-to-head test is untouched.

Built on a 5-county pilot set: 12086 Miami-Dade FL · 04013 Maricopa AZ · 42101 Philadelphia PA · 48201 Harris TX · 29095 Jackson MO. Feature tiers, horizon, T0 convention and the modeling ladder all trace to notes/MASTER_PLAN.md; the hard rules below are from CLAUDE.md.

1 · Data & silver 2 · Label (y_sold) 3 · Features (tiers A–F) 4 · Walk-forward folds 5 · Training / architecture 6 · Calibration 7 · Ranking & output

Data & silver — the T0 snapshot WIP

A monthly First American parcel snapshot plus 23 owner-distress trajectories, read as-of the T0 month-end. T0 is a YYYY-MM month-end boundary; every feature must be computable from the T0 snapshot alone — never future data.

Internal property and owner state is already in the sandbox at (FIPS, PropertyId, Period) granularity, streamed from the First American silver layer at Hudi period/FIPS layout, deduped at read by QUALIFY ROW_NUMBER() … ORDER BY _hoodie_commit_time DESC = 1. The feature builder reads only the as-of-T0 columns (base_globs = _globs([t0], fips)), so a row's state cannot draw on snapshots dated after its own T0.

Source: src/new_model/data.py · src/new_model/feature_cache.py · src/new_model/address_normalizer.py · MASTER_PLAN §2 (data availability)

25-silver-variable-universe 34-silver-columns-round2 35-silver-columns-final 8-infrastructure 29-cleaning-protocol

Label — y_sold DONE

The binary target: y_sold = 1 iff any sale is recorded in the 6-month window (T0, T0+6], else 0. Earlier T0s stop at T_today − 6 months because later labels are not yet observable.

This phase predicts ANY sale. The arms-length document-type filter is intentionally OFF — every sale event counts as y_sold = 1 (foreclosures, quit-claims, probate, divorce included). This is scope, not an oversight (CLAUDE.md hard rule #3); re-running with the filter is a planned second pass once Eduardo signs off. Do not read this as a bug.

Source: src/new_model/features.py (label window) · MASTER_PLAN §3 (label & horizon) · CLAUDE.md hard rule #3

12-investor-deal-criteria 18-deal-oracle-v1 37-oracle-v2 39-oracle-v1-vs-v2-ab 13-deal-pattern-analysis 19-distress-forensics

Features — tiers A–F WIP

One canonical feature set, partitioned into six tiers per MASTER_PLAN §4 / CLAUDE.md. The trees learn the interactions; selection is multivariate, not univariate.

Tier	Family	Note
A	Property physical	Parcels, size, use type, year built.
B	Owner + distress	23 distress trajectories, absentee level, leverage. The behavioral spine.
C	Valuation + activity	AVM, appreciation, days-ownership, listing activity.
D	Date-diffs	`listing_duration_months`, `months_since_prev_sale`, `mortgage_age_months` — UNDER LEAKAGE AUDIT.
E	National macro (FRED)	Mortgage rate, Fed funds, HPI, CPI, unemployment. Same value for every property within a T0 month.
F	Local market	BLS county unemployment, ACS income, FHFA state HPI. Being wired in.

Three Tier-D features are under active leakage audit — listing_duration_months, months_since_prev_sale, mortgage_age_months (CLAUDE.md hard rule #7). Do not cite their contribution as validated until the ablation closes. The exact feature count differs across pages — see finding 45 and the model card for the curated set rather than asserting one number here.

Source: src/new_model/features.py · src/new_model/macro.py (Tier E) · src/new_model/micro.py + src/new_model/demographics.py (Tier F) · MASTER_PLAN §4 · CLAUDE.md §7

2-feature-importance 14-feature-iteration-journal 22-variable-sanity-audit 7-external-variables 17-external-alignment 45-feature-subtraction

Walk-forward folds + case-control negatives WIP

Train on everything up to a fold, evaluate the next fold, advance six months, repeat — never random K-fold (that leaks time). The T0 boundary is enforced structurally so no fold trains on its own future.

Horizon H = 6 months; the eval window is embargoed by horizon_months so train-label and eval windows do not overlap. Finding 32 caught the original layout leaking a 5-month label-window overlap (eval started one month after train-end with H=6); the default now pushes eval start to train_t0s[-1] + horizon + 1 month. Six walk-forward folds run on the 5-county set; the seventh fold is the locked March-2025 test.

Source: src/new_model/walkforward.py (Fold defs, embargo, dedupe_contaminated_property_ids) · scripts/train_fold.py · MASTER_PLAN §3

5-regime-folds 32-fold-embargo-analysis 1-five-county-fold1

Training / architecture WIP

Gradient-boosted trees on the case-control-downsampled training table. The architecture is decided on a 5×4 ablation matrix (HistGB vs LightGBM vs logistic vs random forest), not assumed.

The modeling ladder (MASTER_PLAN §6) climbs three rungs — logistic regression (interpretable floor) → LightGBM (production candidate) → state-stratified / hierarchical — each with a go/no-go gate over Alpha. HistGB is the current default per the ablation; treat the matrix as the authority for which architecture wins each county rather than a single verdict here. Class weighting is none — the table arrives downsampled 1:K, and prior-correction back to the true base rate is the calibration step's job (stage 6).

Source: scripts/train_fold_arch.py (lgbm / logistic / histgb / random_forest) · scripts/train_fold.py · MASTER_PLAN §6 (modeling ladder)

3-architecture-ablation 47-lightgbm-vs-histgb 4-jackson-deep-dive 44-model-saturation-metric

Calibration WIP

Turn the raw ranker scores into probabilities that read true. Isotonic regression, fit on a held-out slice drawn at the true population base rate — not the downsampled 1:K training pool.

A trained ranker is not a forecaster: raw boosted scores rank well but do not read as calibrated probabilities. Calibration maps scores → probabilities against a held-out, non-downsampled, true-prior slice so the prior-correction the downsampling deferred is applied here. The ±15% calibration tolerance is the macro-model win-condition target (MASTER_PLAN §12), not a measured result — Apollo is still in build.

Source: src/new_model/features.py + scripts/train_fold.py (isotonic + prior-correction) · MASTER_PLAN §6 / §12

10-prior-ratio-calibration 46-isotonic-held-out 36-calibration-analysis 6-calibration-brier 43-recalibration-attempt

Ranking & output SPEC

Rank properties by calibrated sale probability within each county. The output contract matches Alpha: a 0–100 score per property, scoped to its county.

Cross-county scores are not directly comparable — each county is ranked against itself, so a sidecar (per-county base rate / calibration context) is required before any cross-county comparison. The Alpha head-to-head is read at the top of each county's ranked list. The win condition — top-decile recall ≥ Alpha and ≥ Camilo, with 30/60/90-day calibration within ±15% — is a target on the locked March-2025 test, which remains untouched until Eduardo + Camilo sign off in writing (CLAUDE.md hard rule #4).

Source: src/new_model/alpha.py (incumbent reproduction) · scripts/train_fold.py (per-county ranking) · MASTER_PLAN §12 (success criteria)

41-alpha-head-to-head 11-h4-transfer 54-property-age-stratification 48-3-agent-audit

Status, honestly

Apollo is building, not shipped. The GATED locked March-2025 head-to-head test is untouched until Eduardo + Camilo sign off in writing (hard rule #4). The win condition is a target, not a result. Three Tier-D date-diff features are under active leakage audit (hard rule #7). The arms-length filter is intentionally OFF this phase — scope, not a bug (hard rule #3). Stages 1–6 are active workstreams; stage 7's cross-county sidecar is SPEC. Read the Rules page for the hard business rules in full, and the REI overview for the model surface.

Spine from notes/MASTER_PLAN.md + src/new_model + CLAUDE.md conventions · REI findings 1–54 · status: building. Back to REI overview · Rules · sibling Roofing pipeline.