Instant DevelopersGuidesSemantic Selection

Semantic Selection

|||

Section-aware content selection, LCI routing, and semantic quality scoring.

Semantic selection identifies the highest-value content within documents using section-aware routing, semantic scoring, and LCI (Learning Content Intelligence) guided promotion.

Section-aware selection

When semantic_selection.section_mode is enabled, INDW routes document sections (headings, paragraphs, code blocks, lists) through independent quality assessments:

yaml
semantic_selection:
  enabled: true
  section_mode: true

Section classification in indw.extract.sections identifies:

Section typeRouting
HeadingStructure recovery
ParagraphQuality scoring
Code blockCode preservation path
Q&A pairQA extraction
ListDensity assessment
ReferenceCitation handling

Semantic cleaning thresholds

Embedded semantic thresholds control line-level noise removal:

yaml
cleaning:
  semantic_embedded:
    url_line_ratio: 0.35
    lead_noise_remove: 0.48
    edge_line_remove: 0.55
    suffix_threshold_fence: 0.72
    suffix_threshold_plain: 0.58

These thresholds apply per-line scoring during semantic cleaning. Lines above the removal threshold are stripped; lines below are preserved.

LCI routing

LCI (Learning Content Intelligence) routes documents to appropriate heavy stage pools based on content fingerprints and PCI scores:

yaml
orchestration:
  enabled: true

When orchestration is enabled, indw.schedule.intel.router evaluates document fingerprints and promotes high-value documents to priority processing tiers. Documents with low intelligence scores may skip expensive extraction stages.

Document type assessment

indw.extract.assess classifies documents by type (web, wiki, code, reasoning, conversation, docs, qa) and applies type-specific quality metrics:

python
from indw.extract.assess.doc_type import classify_document_type
 
doc_type = classify_document_type(text, url=url)
# "web", "wiki", "code", "reasoning", ...

Document type feeds into balance caps and curriculum weighting.

Curriculum weighting

yaml
curriculum:
  enabled: true
  stage: core
  stage_weights:
    reasoning: 1.2
    code: 1.1
    docs: 1.0
    wiki: 0.95
    web: 0.85
    conversation: 0.8
    qa: 0.85
  min_stage_score: 0.42

Curriculum weighting adjusts effective quality scores by document type, promoting reasoning and code content in training mixtures.

Semantic diversity

Synthetic defense uses semantic diversity scoring to detect homogeneous or machine-generated content:

yaml
synthetic_defense:
  enabled: true
  min_semantic_diversity: 0.18
  max_synthetic_score: 0.72

Documents with low semantic diversity (repetitive templates, generated filler) are rejected even if individual quality metrics pass.

Role-based extraction

indw.extract.roles provides domain-specific extraction for education, publication, and other content roles. Role detection influences section routing and entity extraction schemas.

Rendering diagram...
© 2026 Instant Developers. All rights reserved.