Instant DevelopersArchitectureStage0 Filter

Stage0 Filter

|||

The fast-rejection layer that eliminates low-quality documents before expensive downstream processing.

Stage0 is the first filtering layer in every INDW pipeline. It runs lightweight, high-throughput checks to reject obviously bad documents before they consume CPU time in semantic cleaning or extraction stages.

< 2s
Per 1,000 docs
60–80%
Typical rejection rate
5
Stage marks
0
ML models required

How it works

Stage0 applies a sequence of deterministic gates in process_stage0_batch. A document fails as soon as any gate rejects it — no further checks run. Gates are ordered cheapest-first to maximize throughput.

Built-in gates

GateWhat it checksConfigurable
LanguageDetected language vs allowlistYes — language section
Exact doc dedupContent hash against seen setYes — dedup.exact
LengthMin/max character countYes — thresholds.min_chars, max_chars
StructuralAlpha ratio, entropy, HTML scoreYes — thresholds section
PIIEmail, phone, SSN patternsYes — pii section
ToxicityKeyword blocklist + heuristicsYes — toxicity section
LicensingSource domain policyYes — licensing section

Stage0 content filters

run_stage0_content_filters applies structural checks using document_gate_raw features:

python
from indw.filter.stage0.engine import run_stage0_content_filters, worker_gate_policy
 
pol = worker_gate_policy()
reject = run_stage0_content_filters(
    text,
    meaningful_chars=meaningful_chars,
    raw=raw_features,
    pol=pol,
    normalized=True,
)

Rejection reasons are string codes (e.g. too_short, low_alpha_ratio, high_html_score) logged to the reject log when observability is enabled.

PCI admission

After Stage0 gates pass, evaluate_admission assigns a PCI tier based on meaningful character count. Tier assignment determines whether the document enters heavy stage pools or is terminated.

python
from indw.filter.stage0.admission import evaluate_admission
 
decision = evaluate_admission(meaningful_chars=meaningful_chars)
# decision.tier → TIER0 or TIER1

Configuration

yaml
thresholds:
  min_chars: 150
  max_chars: 100000
  min_alpha_ratio: 0.45
  max_html_score: 0.15
  max_repetition: 0.65
 
language:
  enabled: true
  fast_detector_only: true
 
pii:
  enabled: true
  mode: structural_only
 
toxicity:
  enabled: true
  mode: rules_only
 
licensing:
  enabled: true
  reject_proprietary: true
  reject_paywalled: true
Ordering matters:

Stage0 gates run in fixed engine order: language → dedup → structural → admission. Put the cheapest configurable checks (length, language) in your thresholds to maximize throughput.

Audit Stage0 throughput

bash
indw audit --kind stage0 --workers 4

This runs stage0_production_verify.py to measure per-1k-document latency and rejection distribution.

Writing a custom gate

Extend the structural filter by adding checks to a custom quality profile or implementing gate logic in the cleaning pipeline's document_gate stage:

python
from indw.clean.gate.evaluate import document_gate_raw
 
features = document_gate_raw(text)
if features.get("custom_signal", 0) > 0.8:
    return "custom_reject_reason"
© 2026 Instant Developers. All rights reserved.