Scraping catalog Stage 1 · discovery only
County-portal source discovery — where the permit, pre-foreclosure, distress and court-record signals live, and whether each is reachable.
Why this exists
BuildZoom-aggregated permits cover 14.86% of properties nationally with a ~6-month lag (see Data sources). Direct county scraping is the gap-fill + freshness play: pre-foreclosure (Notice of Default / Lis Pendens) is a leading distress signal absent from current gold data, and distress + court records add owner-trajectory signals beyond aggregated property records. The Data section documents the existing permit ↔ property join that scraped sources would feed.
The stage plan
Catalog every (county × datatype) source, prove reachability, score viability 1–5, save one probe per cell, run a QA gate. This page renders Stage 1.
Pull 30–90 days from 3–5 highest-viability sources (
viability_score ≥ 4, cleanest ToS, structured CSV/JSON). Gated on counsel review + storage/Firecrawl budget + a cross-validation harness vs BuildZoom. Pass gates include ≥90% field completeness and ≥80% overlap vs BZ where both exist.Extend pilot scrapers to all viable sources across the top counties; add weekly cron + failure alerting. Triggered only when Stage 2 hits pass gates on ≥3 of 5 pilots with no legal escalation.
Integrate into the gaia ETL, write silver/gold parquet, surface a
permit_scrape_coverage_pct metric, reconcile daily vs BZ. Only here do features wire into the model (Tier B distress).Scope of the discovery
The top-10 counties cataloged: Los Angeles 06037, Cook 17031, Harris 48201, Maricopa 04013, San Diego 06073, Orange CA 06059, Kings/Brooklyn 36047, Miami-Dade 12086, Dallas 48113, Riverside 06065. A top-200 extension (190 more counties, rank 11–200) exists for permits and pre-foreclosure via a two-tier method: deep-dive for rank 11–50, state-vendor-pattern inheritance for rank 51–200.
Permits
Top-10 verdict: 5 viable (score ≥ 4), 3 doable-but-heavy (score 3), 2 effectively blocked (≤ 2). Every county that returned a parseable record showed a roofing surface. Top-3 pilots are Los Angeles (Socrata pi9x-tg5x), San Diego (seshat flat CSV), and Brooklyn/Kings (Socrata DOB NOW rbx6-tga4).
| FIPS | County | Top sub-juris / vendor | Format | Viab | Blocker |
|---|---|---|---|---|---|
| 06037 | Los Angeles, CA | City of LA · Socrata | json_api | 5 | — |
| 17031 | Cook, IL | City of Chicago · Socrata | json_api | 5 | — |
| 06073 | San Diego, CA | City of SD · seshat CDN | csv_download | 5 | — |
| 36047 | Kings / Brooklyn, NY | NYC DOB · Socrata DOB NOW | json_api | 5 | legacy dataset frozen at 2020 — use rbx6-tga4 |
| 04013 | Maricopa, AZ | City of Phoenix · custom Kendo | html_search | 3 | Open Data aggregate-only; row-level only on live PDD |
| 06059 | Orange, CA | Anaheim +4 cities · Accela | html_search | 3 | 34 cities × 1 portal each — fragmentation |
| 12086 | Miami-Dade, FL | RER · custom ASP.NET | html_search | 3 | session-only menu, no bulk endpoint |
| 48113 | Dallas, TX | City of Dallas · Accela | html_search | 3 | Socrata dataset frozen at 2019 |
| 48201 | Harris, TX | City of Houston · custom + Cloudflare WAF | html_search | 2 | Cloudflare blocks direct fetch |
| 06065 | Riverside, CA | RCTLMA · unknown | html_search | 2 | no working portal URL found in 5 probes |
No portal in this set carried an explicit "no scraping" clause. NYC Open Data and San Diego data are the most permissive; Socrata portals carry a Crawl-delay: 1. Top-200 extension adds 36 viable counties (score ≥ 4, 18.9% of the 190), 121 mid (score 3), 33 fragmented (score 2), 0 blocked; new pilot picks are Philadelphia 42101, Detroit/Wayne 26163, Travis 48453, Allegheny 42003, Fairfax 51059. Accela is the single dominant vendor at the 200-county scale (~47.5%).
Pre-foreclosure
Top-10 verdict: 2 directly scrape-viable today (Dallas, Kings/NYSCEF), 4 require browser mode or two-hop enrichment (Maricopa, Miami-Dade, San Diego, Orange CA), 3 effectively non-viable (LA, Cook, Harris), and 1 explicitly forbidden via robots.txt (Riverside). Viability splits along the judicial / non-judicial regime axis, not by vendor. The national aggregator Auction.com is OFF the table — robots.txt = Disallow: / plus an explicit anti-scraping ToS.
| FIPS | County | Regime | Source | Robots | Viab |
|---|---|---|---|---|---|
| 36047 | Kings / Brooklyn, NY | judicial | NYSCEF guest | allow | 5 |
| 48113 | Dallas, TX | non-judicial | dallascounty.org + publicsearch.us | mostly allow | 5 |
| 04013 | Maricopa, AZ | non-judicial | recorder.maricopa.gov (doc-code NS) | silent | 4 |
| 12086 | Miami-Dade, FL | judicial | MFOS + Foreclosure Registry | silent | 4 |
| 17031 | Cook, IL | judicial | Clerk of Court | allow | 3 |
| 06073 | San Diego, CA | non-judicial | arcc-acclaim.sdcounty.ca.gov | allow | 3 |
| 06059 | Orange, CA | non-judicial | cr.occlerkrecorder.gov | silent | 3 |
| 06037 | Los Angeles, CA | non-judicial | lavote.gov / LexisNexis | silent | 1 |
| 48201 | Harris, TX | non-judicial | cclerk.hctx.net | Incapsula + forms blocked | 1 |
| 06065 | Riverside, CA | non-judicial | rivcoacr.org | disallow Claude/AI | 1 |
| — | Auction.com (national aggregator) | n/a | auction.com | Disallow: / | 1 |
Top-200 extension adds 22 viability-5 + 36 viability-4 entries, powered by three high-coverage state portals: NYSCEF (NY), publicsearch.us (TX), and per-county Clerk-of-Court (FL). The Michigan / Rhode Island / Mississippi "publish-in-newspaper" regime is the dominant gap, suppressing ~12 entries to viability 1–2.
Other distress — tax-delinquent · code violations · liens
Three subdomains, 10 cells each. The dominant pattern for tax-delinquent is a state-level aggregator, not per-county scraping. Liens are overwhelmingly negative — county recorders gate-keep behind paid lookup — with NYC ACRIS the rare US exception. The highest-signal cells expose code-violation → recorded-lien escalation inside a single feed.
| Subdomain | Mean viab | Top-3 pilots | Note |
|---|---|---|---|
| Tax-delinquent | 4.2 | Cook 17031 (5) · Harris+Dallas 48201/48113 via LGBS (5) · Maricopa 04013 (4) | TX/CA/FL/AZ statutory regimes differ — encode as entered_default_year + auction_status, not is_at_auction |
| Code violations | 3.7 | Chicago 17031 (5) · Miami-Dade RER 12086 (5) · NYC HPD+DOB ECB 36047 (5) | Miami-Dade RER includes LN_RFRL* lien-referral fields = built-in violation→lien graph |
| Liens | 2.4 | NYC ACRIS 36047 (5) · Miami-Dade RER 12086 (5 effective, via code feed) · no third | 8 of 10 counties at viability 2 — do NOT scrape county-recorder liens; use commercial feed or code-enforcement byproduct |
Stage-2 recommendation covers ~30M people across 5 sources: Chicago + NYC violations (Socrata), Miami-Dade RER (ArcGIS REST, lien escalation compound signal), Cook tax-sale results (Socrata), and LGBS TX tax sales (one scraper, two cells, ~30 TX counties). Riverside 06065 code-enforcement is blocked (Cloudflare 520); San Diego city is stale (last update 2018).
Court records — evictions · divorce · probate
The most legally-encumbered domain. Of 30 cells (10 counties × 3 subdomains), 20 are HARD-BLOCKED by Terms-of-Use, state rule, or both; 4 are viable via official paid bulk-data channels; the remaining 6 are marginal. Default skip Court records are the default-skip lane pending counsel review.
| State / county | Restriction | Effect |
|---|---|---|
| CA — LA / SD / OC / Riverside | Rule of Court 2.503 (bulk prohibited); CCP §1161.2 (eviction index sealed 60 days) | all 4 counties × 3 subdomains blocked |
| NY — Kings 36047 | WebSurrogate ToU §3 — bot ban with criminal/civil penalty language | blocked |
| IL — Cook 17031 | GAO 02-03 + ToS commercial-use ban (no "information bureau") | blocked |
| TX — re:SearchTX (District/Probate) | ToU §2.6 — programmatic/bulk collection for commercial use prohibited | blocked |
| AZ — Maricopa 04013 | eAccess excludes probate/family/juvenile statewide | not published online |
| FL — Miami-Dade 12086 | Commercial Data Services API ($110/folder/mo + $0.20/call, notarized registration) | probate + evictions both viability 5 |
| TX — Harris 48201 (JP) | JP Public Data Extract Service — official CSV/XML/TSV, outside re:SearchTX §2.6 | evictions viability 5 |
Property-linkage quality: evictions STRONG (premises address is a complaint element), probate STRONG with a 60–180-day inventory-filing lag (FL Rule 5.340, TX Estates §309, IL Probate §14-1, CA Prob §8800), divorce WEAK (address only inside sealed documents). Recommended single pilot: Miami-Dade probate + civil eviction via the Commercial Data Services API. Do NOT pursue any CA Superior Court, any NY Surrogate/Civil (Kings), Cook County, or re:SearchTX.
QA gates
Two independent QA passes ran over the catalogs before any Stage-2 launch could be approved.
The URL-drift / reachability audit re-probed live endpoints: permits.json 0.0% hard-fail (PASS), preforeclosure.json 0.0% (PASS), court_records.json 13.3% (PASS), permits_extended.json 9.1% (conditional pass). No file exceeded the 20% hard-fail blocking threshold, so Stage-2 launch is not blocked by drift. Verdict: 28 production-ready sources approved (top-10 + 12 URL'd viable permits_extended rows + 4 viable preforeclosure + 4 viable court_records), with a HOLD on 24 high-score-but-no-URL permits_extended rows and 4 compound-URL court_records rows pending schema fixes. This is a pilot-readiness verdict, not a production launch.
The catalog-integrity audit over 10 JSON files passed 10/10 existence, 10/10 valid JSON, 10/10 row-count match, 9/10 FIPS clean, 8/8 viability in [1,5]. Two minor fixes flagged: the Auction.com row carries county_fips: "AGGREGATOR" (breaks the 5-digit FIPS contract), and the FIPS field name differs across files (county_fips vs fips5) — both to patch before any automated pipeline consumes the files.
Hard rules baked into every agent
| 1 | Respect robots.txt. Disallowed → viability_score: 1 and DO NOT scrape a sample. |
| 2 | ToS scrape clause documented verbatim. Prohibited → viability_score: 1. |
| 3 | No login attempts, especially attorney-only court portals (legal risk in CA/NY). |
| 4 | FIPS as 5-digit zero-padded string everywhere. |
| 5 | Court + foreclosure records contain PII — no redistribution; aggregate-only for model features. |
| 6 | Firecrawl budget capped per agent. |
Open decisions blocking Stage 2
Counsel review on bulk-scrape ToS for the top-3 pilot sources · storage budget (~1GB/county/year est.) · Firecrawl monthly budget cap · a cross-validation harness design vs BuildZoom · and the explicit decision on whether to include court records in the pilot (default: skip). Stage-2 launch needs an explicit go from Ignacio at a Stage-1 results review.
Rendered from notes/scraping/ (README + STAGE_PLAN + per-datatype source catalogs).