Filtering
Configure Stage0 gates, quality thresholds, PII detection, toxicity scoring, and licensing policies.
INDW filtering spans Stage0 fast rejection and downstream quality gates. This guide covers configuring rejection thresholds, policy modes, and adaptive calibration for production corpora.
Filtering layers
| Layer | When | Cost | Typical rejection |
|---|---|---|---|
| Stage0 structural | Before heavy stages | Very low | 60–80% |
| Language gate | Stage0, first mark | Low | 10–20% of remainder |
| Quality gate | After cleaning | Medium | 5–15% |
| Balance caps | At apply | Low | Domain-dependent |
Quality thresholds
QualityThresholds controls signal-based rejection after cleaning:
Set high_quality_only: true for premium corpora that require top-percentile documents only.
PII filtering
| Mode | Detection | Speed |
|---|---|---|
structural_only | Regex patterns (email, phone, SSN) | Fast |
ner | Named entity recognition | Slower, more accurate |
Reject documents exceeding thresholds.max_pii_score.
Toxicity filtering
rules_only uses keyword blocklists and heuristics — no ML model required. Enable classifier_enabled for model-based toxicity scoring when the embedding extra is installed.
Licensing policy
LicensePolicyConfig classifies documents by source domain and URL patterns. Proprietary and paywalled content is rejected when configured.
Adaptive calibration
Adaptive calibration adjusts rejection thresholds based on corpus distribution during the warmup phase:
During warmup, INDW samples quality scores into a reservoir. After warmup, documents below the anchor percentile are downranked. This prevents over-rejection on diverse corpora and under-rejection on homogeneous ones.
Synthetic defense
Detects machine-generated content and repeated span patterns common in SEO spam and synthetic data contamination.
QualityGate API
Start with quality_fast_first.yaml for initial runs. Tighten thresholds incrementally and run indw validate after each change to verify hash stability on your acceptance corpus.