Scheduler
INDW's merge scheduler: admission tiers, ordered apply, checkpoints, and mixture orchestration.
The scheduler (indw.schedule) coordinates the full merge lifecycle: source interleaving, stage pool dispatch, admission tier assignment, dedup stack management, and ordered apply to output. It is the operational core that makes deterministic output possible at any scale.
Merge lifecycle
Bootstrap
bootstrap_merge_run initializes the merge context:
- Loads
QualityPipelineConfigand resolvesPipelineConfigContext - Creates
CorpusCleaningPipeline,QualityGate, and dedup stack - Opens or restores checkpoint state from work directory
- Acquires merge run lock to prevent concurrent writes
Source interleaving
_InterleavedSources reads multiple raw/<source>/data.jsonl files in weighted round-robin order. Each source maintains an independent line offset for checkpoint resume.
Apply coordinator
The apply coordinator writes survivors to output in strict document sequence. Out-of-order results from parallel workers are buffered and applied when their sequence number is reached. This guarantees deterministic ordering regardless of worker completion order.
Admission tiers
PCI admission assigns documents to tiers before heavy processing:
| Tier | Name | Behavior |
|---|---|---|
| TIER0 | Fast pass | Proceeds to heavy stage pools immediately |
| TIER1 | Rejected | Terminal — written to reject log, not output |
evaluate_admission in Stage0 assigns tiers based on meaningful character count and structural signals. Documents rejected at Stage0 receive TIER1.
Checkpoints
MergeCheckpoint tracks per-source progress:
Resume behavior:
- Default: resume from last checkpoint (
resume=True) --fresh: clear checkpoints and start clean- Checkpoint interval configurable via
checkpoint_interval
Two merge processes writing to the same work directory can corrupt checkpoints. The merge run lock prevents this by default. Use --force only when you intend to take over a stale lock.
Dedup stack
The scheduler builds and restores a three-layer dedup stack during bootstrap:
| Layer | Type | Scope |
|---|---|---|
| Exact | SHA-256 content hash | Global across corpus |
| Fuzzy | MinHash LSH | Near-duplicate detection |
| Semantic | Embedding + FAISS | Paraphrase detection |
Dedup state persists in the work directory for resume. restore_merge_dedup_from_output rebuilds indexes from prior output when resuming.
Mixture orchestration
MixtureOrchestrationConfig controls source weighting and domain balance:
build_corpus_mixture_plan computes target mixture fractions. The apply coordinator enforces caps during ordered write, bypassing caps for documents above the quality threshold.
Finalize
run_merge_finalize completes the merge:
- Flushes dedup indexes and checkpoint state
- Computes
sorted_output_hashfor determinism verification - Emits stage profile and reject log summaries
- Releases merge run lock