Architecture Overview
How INDW's canonical execution graph processes documents from ingest through ordered apply.
INDW runs a single canonical execution graph. Documents flow through ingest and Stage0, pass PCI admission tiers, enter parallel heavy stage pools, and are written by an ordered apply coordinator. Heavy stage logic is backend-agnostic — only the transport layer changes between local, thread, multiprocess, and Dask execution.
High-level flow
Subsystems
| Subsystem | Package | Responsibility |
|---|---|---|
| Ingest | indw.ingest | Download, HF datasets, format normalization |
| Filter | indw.filter | Stage0 gates, quality scoring, PII, toxicity, licensing |
| Clean | indw.clean | Semantic cleaning, artifact discovery, structure recovery |
| Extract | indw.extract | Section routing, entity extraction, assessment |
| Dedup | indw.dedup | Exact hash, MinHash fuzzy, embedding semantic |
| Schedule | indw.schedule | Merge graph, admission, dispatch, apply coordinator |
| Store | indw.store | Corpus registry, atomic I/O, JSONL export |
Determinism contract
The apply coordinator writes survivors in strict document sequence. The canonical output hash (sorted_output_hash) is stable across worker counts and execution backends when the quality configuration and input corpus are unchanged.
Backend abstraction
Execution backends implement a common contract (indw.schedule.backends). The pipeline runner dispatches document batches to worker pools without knowing whether workers are threads, processes, or Dask tasks.
| Backend | Use case |
|---|---|
local | Debugging, single-document tracing |
thread | I/O-bound workloads |
multiprocess | CPU-bound cleaning and extraction (default) |
dask | Cluster-scale corpora (100GB+) |
Default backend is multiprocess. Set INSTANT_PIPELINE_BACKEND=dask for cluster execution without changing pipeline code.
Work directory artifacts
Every merge run populates the work directory with operational artifacts:
- Resolved configuration snapshot
- Per-source line-offset checkpoints
- Reject logs and stage profiles (when observability enabled)
- Dedup index state for resume
- JSON audit reports