Instant DevelopersArchitectureExecution Graph

Execution Graph

|||

How INDW resolves, dispatches, and executes the canonical merge stage graph.

The execution graph is INDW's single canonical representation of pipeline stages. Every merge run resolves the same graph regardless of backend — workers execute stage pools in parallel, and the apply coordinator merges results in document sequence order.

Graph structure

Documents enter the graph as rows from interleaved source streams. Each document carries a MergeDocumentContext that accumulates stage marks, rejection reasons, quality scores, and dedup hashes through the pipeline.

Stage marks

Each document context tracks which stages have processed it via stage marks:

MarkStagePurpose
STAGE_FAST_FILTERLanguage gateEarly language detection and rejection
STAGE_DOC_DEDUPExact doc dedupContent hash against seen set
STAGE_STRUCTURAL_FILTERStage0 content filtersLength, entropy, structural checks
STAGE_METADATAIngest enrichmentURL, domain, provenance metadata
STAGE_ADMISSIONPCI admissionTier assignment (TIER0, TIER1)

Stage0 processing in process_stage0_batch runs these marks in order. A rejection at any mark terminates the document with a tier assignment.

Dispatch model

The scheduler dispatches documents in chunks (default 500 per batch) to worker pools:

python
# Conceptual dispatch flow
batch = read_chunk(sources, chunk_size=500)
stage0_result = process_stage0_batch(batch)
clean_result = process_clean_pool(stage0_result.survivors)
# ... extract, dedup, quality gate
apply_coordinator.write_ordered(results)

Worker count and chunk size are resolved from CLI flags, environment, and hardware probes:

bash
indw merge ./raw ./out/filtered.jsonl --workers 8 --chunk-size 500

MergeDocumentContext

The document context is the carrier object through the graph:

python
from indw.schedule.state.context import MergeDocumentContext
 
ctx = MergeDocumentContext.from_survivor_payload(item)
ctx.mark(STAGE_FAST_FILTER)
ctx.reject("language_not_allowed")  # terminates document
ctx.survivor_payload(work_dir=work_dir)  # serializable output

Rejected documents carry _reject_tier and rejection reason. Survivors carry stage0_cleared, quality scores, and dedup hashes for downstream pools.

PipelineRunner

PipelineRunner (indw.schedule.stages.runner) orchestrates stage pool execution. It resolves the active backend, allocates worker pools, and coordinates batch handoff between stage pools without duplicating stage logic per backend.

Rendering diagram...

Parity guarantees

The execution graph is designed for backend parity. Integration tests verify:

  • Local vs multiprocess output hash match
  • Workers=1 vs workers=N hash match
  • Tier admission consistency across backends

Run indw validate to execute the parity suite.

© 2026 Instant Developers. All rights reserved.