INDW interleaves sources by mixture weights. Place each source under raw/<name>/data.jsonl. The scheduler reads sources in weighted round-robin order, preserving per-source sequence for checkpoint resume.
Domain caps limit the fraction of output from each content domain. High-quality documents can bypass caps when their quality score exceeds quality_cap_bypass_score.
Audit the run
After merge completes, inspect pipeline health:
bash
indw audit --kind pipeline --work-dir ./work
For Stage0 throughput verification:
bash
indw audit --kind stage0 --workers 4
Certification
Before promoting to production scale:
bash
indw validateindw benchmarkindw audit --kind production