Instant DevelopersGet StartedFirst Pipeline

First Pipeline

|||

Build a multi-source INDW pipeline with mixture weights, balance caps, and quality profiles.

Build a production-style pipeline that ingests multiple sources, applies a quality profile, and produces balanced training output.

Project layout

  • corpus-pipeline
    • raw
      • wiki
        • data.jsonl
      • web
        • data.jsonl
      • code
        • data.jsonl
    • configs
      • pipeline.yaml
    • work
    • out
      • filtered.jsonl

Quality profile

Create configs/pipeline.yaml based on the fast-first profile:

yaml
meta:
  id: corpus-pipeline
  purpose: multi-source english corpus with balance caps
 
cleaning:
  enabled: true
  semantic_cleaning: true
  artifact_discovery: true
  knowledge_extraction: true
 
thresholds:
  min_chars: 150
  max_chars: 120000
  min_alpha_ratio: 0.30
  max_boilerplate_score: 0.45
  max_toxicity: 0.45
  max_pii_score: 0.45
 
dedup:
  exact: true
  fuzzy: true
  fuzzy_threshold: 0.90
 
balance:
  enabled: true
  domain_caps:
    web: 0.55
    wiki: 0.18
    code: 0.25
  language_caps:
    en: 0.75
    other: 0.20
 
licensing:
  enabled: true
  reject_proprietary: true

Run with custom config

python
import yaml
from pathlib import Path
from indw.filter.spec.quality import QualityPipelineConfig
from indw.schedule import merge_with_quality
 
cfg = QualityPipelineConfig.from_dict(
    yaml.safe_load(Path("configs/pipeline.yaml").read_text(encoding="utf-8"))
)
 
result = merge_with_quality(
    Path("./raw"),
    Path("./out/filtered.jsonl"),
    quality_config=cfg,
    work_dir=Path("./work"),
    workers=4,
    fresh=True,
)
 
print(f"kept={result.get('kept', 0)} rejected={result.get('rejected', 0)}")

Multi-source interleaving

INDW interleaves sources by mixture weights. Place each source under raw/<name>/data.jsonl. The scheduler reads sources in weighted round-robin order, preserving per-source sequence for checkpoint resume.

yaml
balance:
  enabled: true
  domain_caps:
    web: 0.55
    wiki: 0.18
    code: 0.25

Domain caps limit the fraction of output from each content domain. High-quality documents can bypass caps when their quality score exceeds quality_cap_bypass_score.

Audit the run

After merge completes, inspect pipeline health:

bash
indw audit --kind pipeline --work-dir ./work

For Stage0 throughput verification:

bash
indw audit --kind stage0 --workers 4

Certification

Before promoting to production scale:

bash
indw validate
indw benchmark
indw audit --kind production
© 2026 Instant Developers. All rights reserved.