Stage0 Filter
The fast-rejection layer that eliminates low-quality documents before expensive downstream processing.
Stage0 is the first filtering layer in every INDW pipeline. It runs lightweight, high-throughput checks to reject obviously bad documents before they consume CPU time in semantic cleaning or extraction stages.
How it works
Stage0 applies a sequence of deterministic gates in process_stage0_batch. A document fails as soon as any gate rejects it — no further checks run. Gates are ordered cheapest-first to maximize throughput.
Built-in gates
| Gate | What it checks | Configurable |
|---|---|---|
| Language | Detected language vs allowlist | Yes — language section |
| Exact doc dedup | Content hash against seen set | Yes — dedup.exact |
| Length | Min/max character count | Yes — thresholds.min_chars, max_chars |
| Structural | Alpha ratio, entropy, HTML score | Yes — thresholds section |
| PII | Email, phone, SSN patterns | Yes — pii section |
| Toxicity | Keyword blocklist + heuristics | Yes — toxicity section |
| Licensing | Source domain policy | Yes — licensing section |
Stage0 content filters
run_stage0_content_filters applies structural checks using document_gate_raw features:
Rejection reasons are string codes (e.g. too_short, low_alpha_ratio, high_html_score) logged to the reject log when observability is enabled.
PCI admission
After Stage0 gates pass, evaluate_admission assigns a PCI tier based on meaningful character count. Tier assignment determines whether the document enters heavy stage pools or is terminated.
Configuration
Stage0 gates run in fixed engine order: language → dedup → structural → admission. Put the cheapest configurable checks (length, language) in your thresholds to maximize throughput.
Audit Stage0 throughput
This runs stage0_production_verify.py to measure per-1k-document latency and rejection distribution.
Writing a custom gate
Extend the structural filter by adding checks to a custom quality profile or implementing gate logic in the cleaning pipeline's document_gate stage: