Semantic Cleaning
ACIM artifact discovery, HTML normalization, boilerplate removal, and structure recovery.
The cleaning subsystem (indw.clean) transforms raw HTML and noisy text into clean, structured documents. Semantic cleaning runs after Stage0 and before knowledge extraction.
Cleaning pipeline
CorpusCleaningPipeline orchestrates the full cleaning sequence:
Configuration
ACIM artifact discovery
ACIM (Artifact Content Identification Model) discovers structural noise — navigation menus, cookie banners, comment sections, and template boilerplate — while preserving substantive content.
Shadow mode logs artifact scores without trimming. Use shadow mode when calibrating thresholds on a new corpus type.
HTML processing
INDW uses trafilatura for HTML extraction with custom post-processing:
- Metadata removal (author, date, tags)
- UI noise removal (nav, footer, sidebar)
- Code fence preservation
- Link text normalization
Semantic embedded thresholds
Fine-tune semantic cleaning aggressiveness:
Document gate
After cleaning, document_gate validates the cleaned document against post-clean thresholds:
Documents failing the document gate are rejected before entering extraction.
Knowledge extraction
Enable structure recovery alongside cleaning:
Knowledge extraction routes sections (headings, paragraphs, code blocks, Q&A pairs) for downstream processing in indw.extract.
Set cleaning.minimal: true for pre-cleaned corpora. This skips artifact discovery and aggressive boilerplate removal while retaining document gate validation.