OverviewResources › Scraping catalog

Scraping catalog Stage 1 · discovery only

County-portal source discovery — where the permit, pre-foreclosure, distress and court-record signals live, and whether each is reachable.

Source discovery only (as of 2026-06-05). This catalog is Stage 1: it records where the data lives, scores each source for reachability, and documents robots.txt / Terms-of-Use posture. Nothing here is scraped or wired into any model today. The Stage-2 pilot scrape is gated on counsel review of bulk-scrape Terms-of-Use and a passing QA gate, and needs an explicit go from Ignacio. Court records default to skip pending legal review. Treat every viability score as a planning estimate, not a production signal.

Why this exists

BuildZoom-aggregated permits cover 14.86% of properties nationally with a ~6-month lag (see Data sources). Direct county scraping is the gap-fill + freshness play: pre-foreclosure (Notice of Default / Lis Pendens) is a leading distress signal absent from current gold data, and distress + court records add owner-trajectory signals beyond aggregated property records. The Data section documents the existing permit ↔ property join that scraped sources would feed.

The stage plan

Stage 1
Source discovery + smoke test In progress
Catalog every (county × datatype) source, prove reachability, score viability 1–5, save one probe per cell, run a QA gate. This page renders Stage 1.
Stage 2
Pilot scrape Gated
Pull 30–90 days from 3–5 highest-viability sources (viability_score ≥ 4, cleanest ToS, structured CSV/JSON). Gated on counsel review + storage/Firecrawl budget + a cross-validation harness vs BuildZoom. Pass gates include ≥90% field completeness and ≥80% overlap vs BZ where both exist.
Stage 3
Scale Spec
Extend pilot scrapers to all viable sources across the top counties; add weekly cron + failure alerting. Triggered only when Stage 2 hits pass gates on ≥3 of 5 pilots with no legal escalation.
Stage 4
Productionize Spec
Integrate into the gaia ETL, write silver/gold parquet, surface a permit_scrape_coverage_pct metric, reconcile daily vs BZ. Only here do features wire into the model (Tier B distress).

Scope of the discovery

4
datatypes — permits · pre-foreclosure · other distress · court records
10
top-population counties cataloged per datatype (top-200 extension in flight for permits + pre-foreclosure)
1–5
viability score per source — 5 = scrape-viable today, 1 = blocked by robots / ToS / law
0
production scrapers — Stage 2 not launched (as of 2026-06-05)

The top-10 counties cataloged: Los Angeles 06037, Cook 17031, Harris 48201, Maricopa 04013, San Diego 06073, Orange CA 06059, Kings/Brooklyn 36047, Miami-Dade 12086, Dallas 48113, Riverside 06065. A top-200 extension (190 more counties, rank 11–200) exists for permits and pre-foreclosure via a two-tier method: deep-dive for rank 11–50, state-vendor-pattern inheritance for rank 51–200.

Permits

Top-10 verdict: 5 viable (score ≥ 4), 3 doable-but-heavy (score 3), 2 effectively blocked (≤ 2). Every county that returned a parseable record showed a roofing surface. Top-3 pilots are Los Angeles (Socrata pi9x-tg5x), San Diego (seshat flat CSV), and Brooklyn/Kings (Socrata DOB NOW rbx6-tga4).

FIPSCountyTop sub-juris / vendorFormatViabBlocker
06037Los Angeles, CACity of LA · Socratajson_api5
17031Cook, ILCity of Chicago · Socratajson_api5
06073San Diego, CACity of SD · seshat CDNcsv_download5
36047Kings / Brooklyn, NYNYC DOB · Socrata DOB NOWjson_api5legacy dataset frozen at 2020 — use rbx6-tga4
04013Maricopa, AZCity of Phoenix · custom Kendohtml_search3Open Data aggregate-only; row-level only on live PDD
06059Orange, CAAnaheim +4 cities · Accelahtml_search334 cities × 1 portal each — fragmentation
12086Miami-Dade, FLRER · custom ASP.NEThtml_search3session-only menu, no bulk endpoint
48113Dallas, TXCity of Dallas · Accelahtml_search3Socrata dataset frozen at 2019
48201Harris, TXCity of Houston · custom + Cloudflare WAFhtml_search2Cloudflare blocks direct fetch
06065Riverside, CARCTLMA · unknownhtml_search2no working portal URL found in 5 probes

No portal in this set carried an explicit "no scraping" clause. NYC Open Data and San Diego data are the most permissive; Socrata portals carry a Crawl-delay: 1. Top-200 extension adds 36 viable counties (score ≥ 4, 18.9% of the 190), 121 mid (score 3), 33 fragmented (score 2), 0 blocked; new pilot picks are Philadelphia 42101, Detroit/Wayne 26163, Travis 48453, Allegheny 42003, Fairfax 51059. Accela is the single dominant vendor at the 200-county scale (~47.5%).

Pre-foreclosure

Top-10 verdict: 2 directly scrape-viable today (Dallas, Kings/NYSCEF), 4 require browser mode or two-hop enrichment (Maricopa, Miami-Dade, San Diego, Orange CA), 3 effectively non-viable (LA, Cook, Harris), and 1 explicitly forbidden via robots.txt (Riverside). Viability splits along the judicial / non-judicial regime axis, not by vendor. The national aggregator Auction.com is OFF the tablerobots.txt = Disallow: / plus an explicit anti-scraping ToS.

FIPSCountyRegimeSourceRobotsViab
36047Kings / Brooklyn, NYjudicialNYSCEF guestallow5
48113Dallas, TXnon-judicialdallascounty.org + publicsearch.usmostly allow5
04013Maricopa, AZnon-judicialrecorder.maricopa.gov (doc-code NS)silent4
12086Miami-Dade, FLjudicialMFOS + Foreclosure Registrysilent4
17031Cook, ILjudicialClerk of Courtallow3
06073San Diego, CAnon-judicialarcc-acclaim.sdcounty.ca.govallow3
06059Orange, CAnon-judicialcr.occlerkrecorder.govsilent3
06037Los Angeles, CAnon-judiciallavote.gov / LexisNexissilent1
48201Harris, TXnon-judicialcclerk.hctx.netIncapsula + forms blocked1
06065Riverside, CAnon-judicialrivcoacr.orgdisallow Claude/AI1
Auction.com (national aggregator)n/aauction.comDisallow: /1

Top-200 extension adds 22 viability-5 + 36 viability-4 entries, powered by three high-coverage state portals: NYSCEF (NY), publicsearch.us (TX), and per-county Clerk-of-Court (FL). The Michigan / Rhode Island / Mississippi "publish-in-newspaper" regime is the dominant gap, suppressing ~12 entries to viability 1–2.

Other distress — tax-delinquent · code violations · liens

Three subdomains, 10 cells each. The dominant pattern for tax-delinquent is a state-level aggregator, not per-county scraping. Liens are overwhelmingly negative — county recorders gate-keep behind paid lookup — with NYC ACRIS the rare US exception. The highest-signal cells expose code-violation → recorded-lien escalation inside a single feed.

SubdomainMean viabTop-3 pilotsNote
Tax-delinquent4.2Cook 17031 (5) · Harris+Dallas 48201/48113 via LGBS (5) · Maricopa 04013 (4)TX/CA/FL/AZ statutory regimes differ — encode as entered_default_year + auction_status, not is_at_auction
Code violations3.7Chicago 17031 (5) · Miami-Dade RER 12086 (5) · NYC HPD+DOB ECB 36047 (5)Miami-Dade RER includes LN_RFRL* lien-referral fields = built-in violation→lien graph
Liens2.4NYC ACRIS 36047 (5) · Miami-Dade RER 12086 (5 effective, via code feed) · no third8 of 10 counties at viability 2 — do NOT scrape county-recorder liens; use commercial feed or code-enforcement byproduct

Stage-2 recommendation covers ~30M people across 5 sources: Chicago + NYC violations (Socrata), Miami-Dade RER (ArcGIS REST, lien escalation compound signal), Cook tax-sale results (Socrata), and LGBS TX tax sales (one scraper, two cells, ~30 TX counties). Riverside 06065 code-enforcement is blocked (Cloudflare 520); San Diego city is stale (last update 2018).

Court records — evictions · divorce · probate

The most legally-encumbered domain. Of 30 cells (10 counties × 3 subdomains), 20 are HARD-BLOCKED by Terms-of-Use, state rule, or both; 4 are viable via official paid bulk-data channels; the remaining 6 are marginal. Default skip Court records are the default-skip lane pending counsel review.

State / countyRestrictionEffect
CA — LA / SD / OC / RiversideRule of Court 2.503 (bulk prohibited); CCP §1161.2 (eviction index sealed 60 days)all 4 counties × 3 subdomains blocked
NY — Kings 36047WebSurrogate ToU §3 — bot ban with criminal/civil penalty languageblocked
IL — Cook 17031GAO 02-03 + ToS commercial-use ban (no "information bureau")blocked
TX — re:SearchTX (District/Probate)ToU §2.6 — programmatic/bulk collection for commercial use prohibitedblocked
AZ — Maricopa 04013eAccess excludes probate/family/juvenile statewidenot published online
FL — Miami-Dade 12086Commercial Data Services API ($110/folder/mo + $0.20/call, notarized registration)probate + evictions both viability 5
TX — Harris 48201 (JP)JP Public Data Extract Service — official CSV/XML/TSV, outside re:SearchTX §2.6evictions viability 5

Property-linkage quality: evictions STRONG (premises address is a complaint element), probate STRONG with a 60–180-day inventory-filing lag (FL Rule 5.340, TX Estates §309, IL Probate §14-1, CA Prob §8800), divorce WEAK (address only inside sealed documents). Recommended single pilot: Miami-Dade probate + civil eviction via the Commercial Data Services API. Do NOT pursue any CA Superior Court, any NY Surrogate/Civil (Kings), Cook County, or re:SearchTX.

QA gates

Two independent QA passes ran over the catalogs before any Stage-2 launch could be approved.

The URL-drift / reachability audit re-probed live endpoints: permits.json 0.0% hard-fail (PASS), preforeclosure.json 0.0% (PASS), court_records.json 13.3% (PASS), permits_extended.json 9.1% (conditional pass). No file exceeded the 20% hard-fail blocking threshold, so Stage-2 launch is not blocked by drift. Verdict: 28 production-ready sources approved (top-10 + 12 URL'd viable permits_extended rows + 4 viable preforeclosure + 4 viable court_records), with a HOLD on 24 high-score-but-no-URL permits_extended rows and 4 compound-URL court_records rows pending schema fixes. This is a pilot-readiness verdict, not a production launch.

The catalog-integrity audit over 10 JSON files passed 10/10 existence, 10/10 valid JSON, 10/10 row-count match, 9/10 FIPS clean, 8/8 viability in [1,5]. Two minor fixes flagged: the Auction.com row carries county_fips: "AGGREGATOR" (breaks the 5-digit FIPS contract), and the FIPS field name differs across files (county_fips vs fips5) — both to patch before any automated pipeline consumes the files.

Hard rules baked into every agent

1Respect robots.txt. Disallowed → viability_score: 1 and DO NOT scrape a sample.
2ToS scrape clause documented verbatim. Prohibited → viability_score: 1.
3No login attempts, especially attorney-only court portals (legal risk in CA/NY).
4FIPS as 5-digit zero-padded string everywhere.
5Court + foreclosure records contain PII — no redistribution; aggregate-only for model features.
6Firecrawl budget capped per agent.

Open decisions blocking Stage 2

Counsel review on bulk-scrape ToS for the top-3 pilot sources · storage budget (~1GB/county/year est.) · Firecrawl monthly budget cap · a cross-validation harness design vs BuildZoom · and the explicit decision on whether to include court records in the pilot (default: skip). Stage-2 launch needs an explicit go from Ignacio at a Stage-1 results review.

Rendered from notes/scraping/ (README + STAGE_PLAN + per-datatype source catalogs).