Semantic Selection
Section-aware content selection, LCI routing, and semantic quality scoring.
Semantic selection identifies the highest-value content within documents using section-aware routing, semantic scoring, and LCI (Learning Content Intelligence) guided promotion.
Section-aware selection
When semantic_selection.section_mode is enabled, INDW routes document sections (headings, paragraphs, code blocks, lists) through independent quality assessments:
Section classification in indw.extract.sections identifies:
| Section type | Routing |
|---|---|
| Heading | Structure recovery |
| Paragraph | Quality scoring |
| Code block | Code preservation path |
| Q&A pair | QA extraction |
| List | Density assessment |
| Reference | Citation handling |
Semantic cleaning thresholds
Embedded semantic thresholds control line-level noise removal:
These thresholds apply per-line scoring during semantic cleaning. Lines above the removal threshold are stripped; lines below are preserved.
LCI routing
LCI (Learning Content Intelligence) routes documents to appropriate heavy stage pools based on content fingerprints and PCI scores:
When orchestration is enabled, indw.schedule.intel.router evaluates document fingerprints and promotes high-value documents to priority processing tiers. Documents with low intelligence scores may skip expensive extraction stages.
Document type assessment
indw.extract.assess classifies documents by type (web, wiki, code, reasoning, conversation, docs, qa) and applies type-specific quality metrics:
Document type feeds into balance caps and curriculum weighting.
Curriculum weighting
Curriculum weighting adjusts effective quality scores by document type, promoting reasoning and code content in training mixtures.
Semantic diversity
Synthetic defense uses semantic diversity scoring to detect homogeneous or machine-generated content:
Documents with low semantic diversity (repetitive templates, generated filler) are rejected even if individual quality metrics pass.
Role-based extraction
indw.extract.roles provides domain-specific extraction for education, publication, and other content roles. Role detection influences section routing and entity extraction schemas.