Step 1.2 · done ↑ Step 1 — Labeling (parent) · → Step 2 — Coverage (next)
Permit classification & taxonomy
The two-axis permit_scope classifier (v5.3.3) — how every building permit is labelled WHAT it touches and WHAT is being done to it.
Every building permit we ingest is passed through one classifier that emits a single, MECE
classification column: permit_scope. The v4 classifier emitted one value from a flat
25-entry enum that quietly mixed two unrelated questions — WHAT the permit touches (roof,
pool, HVAC) and WHAT action is taken (new, replace, demolish). Five of the 25 "types" were
actions wearing a type costume: a permit to "demolish a detached garage" forced the cascade to pick
either DEMOLITION or GARAGE and threw the other half away. v5
splits the question into two orthogonal axes — an object axis (permit_type) and an
action axis (permit_action) — so the two halves are never lost. The two axes are
emitted paired inside one struct, and that paired list is permit_scope.
The two axes
Each permit_scope item is a {type, action} struct: what the permit
touches drawn from a 20-value object axis, paired with what is done to it drawn from a 7-value
action axis. Both axes are MECE and both are evaluated by a specificity-descending cascade — the most
specific category that matches wins, the catch-alls come last.
Axis 1 · permit_type — the object
20 values · MECE · cascade evaluated most-specific → most-general
| Value | What it is | Note |
|---|---|---|
| SOLAR | Solar / photovoltaic system | Beats ROOFING — a solar-on-roof permit is solar work, not roof-cover work |
| POOL | Swimming pool / spa | Beats ROOFING — a pool-cage roof is pool work |
| FIRE | Fire alarm / sprinkler / suppression | Beats ELECTRICAL & PLUMBING |
| GENERATOR | Standby / backup generator | Beats ELECTRICAL |
| HVAC | Heating, ventilation, A/C | System type — may match broadly in free text |
| ELECTRICAL | Electrical system / wiring | System type |
| PLUMBING | Plumbing / water heater / drains | System type — "water heater in garage" is PLUMBING, not GARAGE |
| ROOFING | Roof-covering system | Anti-FP guarded, compound-first keywords — see below |
| WINDOWS | Windows | |
| DOORS | Doors | |
| GARAGE | Garage as the structure | Location-prone — Tier-1 or strong compound only |
| SHED | Shed / accessory structure | Location-prone — carport stays anti-FP, never SHED |
| DECK | Deck | Location-prone — Tier-1 or strong compound only |
| FENCE | Fence | |
| SIGN | Signage | Above FOUNDATION (sign footings) |
| FOUNDATION | Foundation / underpinning | Below SIGN; bare "footing" dropped |
| SITEWORK | Grading / site / civil work | |
| BUILDING | Whole-structure permit | Broadest object; absorbs v4 NEW_CONSTRUCTION + STRUCTURAL |
| OTHER | Categorizable but none of the above | Catch-all |
| UNKNOWN | No legible signal | Last resort |
Axis 2 · permit_action — the verb
7 values · cascade first-match-wins (specific → general)
| Value | What it is | Note |
|---|---|---|
| DEMOLITION | Remove / tear down whole structure | Guarded: suppressed when remodel / addition / tenant-improvement words co-occur (interior demo ≠ teardown) |
| NEW | First-time install / new construction | |
| REPLACEMENT | Like-for-like swap of an existing system | The confirmed-reroof verb for ROOFING |
| REPAIR | Fix part of an existing system (leak, patch, partial) | For ROOFING, NOT a positive |
| ADDITION | Extend / add to an existing structure | |
| ALTERATION | Modify / remodel / renovate | Broad catch-all action; absorbs v4 INTERIOR / RENOVATION |
| NA | No action signal legible | The norm, not an edge case — see positive-eligibility below |
Why action replaced subtype. permit_action
replaces the v4 permit_subtype — now every permit gets a sub-classification, not just
roofing/solar/HVAC. INTERIOR, RENOVATION, NEW_CONSTRUCTION and DEMOLITION were all v4 "types" that are
really actions; they moved here.
Cascade order (shipped). The v5 spec lists actions
most-specific first; the shipped classify_v5.py evaluates them
NEW (whole-structure) → DEMOLITION → REPLACEMENT →
REPAIR → ADDITION → NEW (component) → ALTERATION →
NA, first match wins. Precedence matters in mixed text:
"new construction reroof" → REPLACEMENT; "teardown interior" →
ALTERATION (DEMOLITION guard).
Signal-tier scoring
Greedy substring matching was the root cause of every precision bug in both QA audits, so the
classifier weights a keyword by where it appears. A field's provenance sets its tier and its
score. A type joins permit_scope when its accumulated signal score ≥ 3, or when it
is the primary classification.
| Tier | Fields | Originator | Score | Trust |
|---|---|---|---|---|
| Tier 1 | TYPE, SUBTYPE | Building Permit — AHJ categorical label | 6 | Authoritative — any keyword match classifies |
| Tier 2 | DESCRIPTION, PROJECT_NAME | Building Permit — owner source text | 4 | Primary text — specific compound keywords only; outranks Tier 3 |
| Tier 3 | PROJECT_TYPE array | BuildZoom-derived classification | 3 | Supporting — classifies only when Tiers 1-2 are silent; loses to a Tier-2 compound |
| Dropped | BUSINESS_NAME | Contractor name | — | Not used for classification (contractor names hijacked rows). Kept as a feature. |
Scoring must satisfy: TYPE/SUBTYPE > DESCRIPTION compound >
PROJECT_TYPE token > generic fallback. Location-prone types (GARAGE, SHED, DECK) fire only
on a Tier-1 match or a strong Tier-2 compound — never on a bare "garage" mention in DESCRIPTION. System
types (HVAC, PLUMBING) may match broadly in Tier 2 because their keywords are unambiguous.
The v5.0.1 re-rank — why PROJECT_TYPE was demoted Tier 1 → Tier 3
PROJECT_TYPE as authoritative
TYPE/SUBTYPE, ranked above the DESCRIPTION source text.PROJECT_TYPE has Originator = BuildZoom (a derived classification), while TYPE/SUBTYPE/DESCRIPTION have Originator = Building Permit (primary source text). A derived field must not outrank the source text it was derived from.PROJECT_TYPE lost their ROOFING classification, because PROJECT_TYPE's HVAC token outranked the DESCRIPTION reroof compound. v5.0.1 demoted PROJECT_TYPE to Tier 3 (score 3): it now classifies only when Tiers 1-2 are silent, and always loses to a DESCRIPTION compound.ROOFING — anti-false-positive guards
ROOFING is the highest-stakes type and the most error-prone, because roof is a substring
of many unrelated words and a roof is the location of much non-roofing work. ROOFING therefore uses
compound-first keywords (re-roof, roof replacement, shingle,
metal roofing, tile roof, tpo, built-up roof) and is
blocked by anti-FP guards unless a genuine-reroof compound fires the STRONG_ROOF rescue.
Blocked by anti-FP
- solar / PV on roof
- patio cover
- screen room / porch
- pool cage / enclosure
- sunroom / lanai
- carport
- pergola
-proofingwords: waterproofing, weatherproofing, soundproofing, fireproofing, dampproofing, rustproofing- roof deck / roof truss / roof drain
STRONG_ROOF rescue
- A genuine reroof compound (e.g.
tear off & re-roof,roof replacement) overrides the anti-FP block. - So "tear off shingles, re-roof, and install solar PV" yields both
ROOFING:REPLACEMENTandSOLAR:NEW— one permit, two real items.
Bare " roof " location guard v5.3.3
- Bare space-padded
" roof "was suppressed when an equipment-location phrase (ROOF_LOCATION: "on roof", "rooftop", "roof mounted", "roof area") is the only context. - Kills "change out 3-ton AC on roof" and "rooftop packaged unit" false positives, which previously added a spurious
ROOFING:NAitem.
Boilerplate stripping v5.3.2
Some jurisdictions append a long standard inspection checklist to every permit DESCRIPTION
("separate permits are required for electrical, plumbing, heating…", "footing and foundation inspection",
"Gopher State One Call prior to digging"). That boilerplate names every trade by design, so the classifier
scored it and stuffed the scope with phantom items that were never the permit's actual work. v5.3.2
truncates DESCRIPTION before the earliest of 8 boilerplate markers
(BOILERPLATE_MARKERS) — only the real work-description prefix is scored.
Output contract recap
The classifier (scripts/roofing/classify_v5.py) writes one row per Gold permit. The only
taxonomy column is:
permit_scope list<struct<type, action>> — every (object, verb) the permit covers
type ∈ the 20 object values, action ∈ the 7 action values. The pairing is
intrinsic (one struct), so type and action can never desync. permit_scope is never
empty: a scopeless permit is [{OTHER, NA}] or [{UNKNOWN, NA}]. No derived
columns (is_roofing, roof_action, a primary permit_type) are baked
into the taxonomy — they are non-MECE projections the consumer derives:
is_roofing='ROOFING'is among thepermit_scopetypesroof_action= theactionof theROOFINGitem
Positive-eligibility rules (the label contract)
| ROOFING item | Treatment | Why |
|---|---|---|
| ROOFING:REPLACEMENT | Positive | Confirmed reroof |
| ROOFING:NA | Positive-eligible | The "ambiguous reroof" bucket — a roof permit defaults to a replacement absent a contrary verb. ~53% of ROOFING items carry NA; dropping them would gut >50% of roof positives. |
| ROOFING:REPAIR | NOT a positive | Patch, not replacement; carried only as a prior-repair feature |
| ROOFING:NEW + BUILDING:NEW | Excluded | New-construction roof, not a reroof event |
| ROOFING:{ADDITION, ALTERATION, DEMOLITION} | Excluded | Not a reroof event |
The full schema, join semantics and consumer contract live on the parent step: Step 1 — Labeling.
Audit headline numbers
Audited 2026-05-21 · v5.3.3 · 31-FIPS local layer · 98,946,381 permits
is_roofing recall vs an independent BuildZoom reference (30-FIPS dev layer; an optimistic upper bound)is_roofing precision (prior audits)MECE check passed: 98,946,381 rows (== raw Gold), 0 empty/null scope, every
type in the 20-enum and action in the 7-enum; 16.6% of rows are multi-item.
Version history
| Date | Version | Change | Impact |
|---|---|---|---|
| 2026-05-21 | v5.0.0 | Built + audited (2 reviewers, 365-row sample) → don't ship | Two-axis model live but PROJECT_TYPE over-trusted |
| 2026-05-21 | v5.0.1 | Signal-tier re-rank: PROJECT_TYPE demoted Tier 1 → Tier 3 | Stopped relabelling ~200K real re-roofs co-tagged HVAC |
| 2026-05-21 | v5.1.0 | Multi-label — emit all categories with signal + primary + is_roofing | Dissolves the single-label tie-break bug class (58% of permits are multi-trade) |
| 2026-05-21 | v5.3.0 | Collapse output to one MECE column permit_scope (intrinsic type↔action pairing) | Drops non-MECE derived columns; they become Step-5 projections |
| 2026-05-21 | v5.3.1 | ROOFING keyword expansion (single ply, sbs, cap sheet, standing seam, torch down…) | +11,169 ROOFING on 9-FIPS, 0 removed (monotone) |
| 2026-05-21 | v5.3.2 | AHJ-boilerplate stripping — truncate DESCRIPTION before the inspection checklist | Minneapolis boilerplate rows 3.84 → 2.16 avg scope items |
| 2026-05-21 | v5.3.3 | " roof " location guard — suppress incidental "AC on roof" mentions | Removed ~144K incidental-location FPs; 0 guard-caused FN |
permit_scope over-tag, slated for Phase 2) and a small EPDM/membrane
keyword gap (v5.4 candidate). The ROOFING:NA → positive-eligible default is documented as an
assumption pending a human hand-grade of that bucket.Rendered from notes/Roofing/steps/02_permit_taxonomy_v5.md + 05_labels_output_contract.md · audit notes/Roofing/audits/2026-05-21_v5.3.3_classifier_audit.md · classifier scripts/roofing/classify_v5.py (v5.3.3). This page is Step 1.2 of the roofing pipeline.