Instant DevelopersArchitectureArchitecture Overview

Architecture Overview

|||

How INDW's canonical execution graph processes documents from ingest through ordered apply.

INDW runs a single canonical execution graph. Documents flow through ingest and Stage0, pass PCI admission tiers, enter parallel heavy stage pools, and are written by an ordered apply coordinator. Heavy stage logic is backend-agnostic — only the transport layer changes between local, thread, multiprocess, and Dask execution.

1
Canonical graph
4
Execution backends
2
Admission tiers
0
Logic changes per backend

High-level flow

text
Ingest → Stage0 → Admission → Heavy pools → Apply → Output

Subsystems

SubsystemPackageResponsibility
Ingestindw.ingestDownload, HF datasets, format normalization
Filterindw.filterStage0 gates, quality scoring, PII, toxicity, licensing
Cleanindw.cleanSemantic cleaning, artifact discovery, structure recovery
Extractindw.extractSection routing, entity extraction, assessment
Dedupindw.dedupExact hash, MinHash fuzzy, embedding semantic
Scheduleindw.scheduleMerge graph, admission, dispatch, apply coordinator
Storeindw.storeCorpus registry, atomic I/O, JSONL export

Determinism contract

The apply coordinator writes survivors in strict document sequence. The canonical output hash (sorted_output_hash) is stable across worker counts and execution backends when the quality configuration and input corpus are unchanged.

Rendering diagram...

Backend abstraction

Execution backends implement a common contract (indw.schedule.backends). The pipeline runner dispatches document batches to worker pools without knowing whether workers are threads, processes, or Dask tasks.

BackendUse case
localDebugging, single-document tracing
threadI/O-bound workloads
multiprocessCPU-bound cleaning and extraction (default)
daskCluster-scale corpora (100GB+)
Backend selection:

Default backend is multiprocess. Set INSTANT_PIPELINE_BACKEND=dask for cluster execution without changing pipeline code.

Work directory artifacts

Every merge run populates the work directory with operational artifacts:

  • Resolved configuration snapshot
  • Per-source line-offset checkpoints
  • Reject logs and stage profiles (when observability enabled)
  • Dedup index state for resume
  • JSON audit reports
© 2026 Instant Developers. All rights reserved.