Distributed Execution
Scale INDW pipelines to Dask clusters and multiprocess workers without changing pipeline logic.
INDW scales horizontally through execution backends. The same pipeline logic runs on a laptop or a Dask cluster — only the transport layer changes.
Backends
| Backend | Env value | Workers | Best for |
|---|---|---|---|
local | INSTANT_PIPELINE_BACKEND=local | 1 | Debugging, tracing |
thread | INSTANT_PIPELINE_BACKEND=thread | N threads | I/O-bound ingest |
multiprocess | Default | N processes | CPU-bound cleaning (default) |
dask | INSTANT_PIPELINE_BACKEND=dask | Cluster | 100GB+ corpora |
Multiprocess scaling
Increase worker count on a single machine:
Output hash remains stable across worker counts when configuration is unchanged. Verify with:
Dask cluster
Install distributed support:
Connect to a running Dask scheduler:
Or use the standard Dask env var:
Example: local Dask cluster
See examples/merge_dask.py for a complete example.
Backend resolution
Aliases are normalized automatically:
| Input | Resolved |
|---|---|
mp, process, spawn | multiprocess |
threads, threaded | thread |
sync, inline, direct | local |
distributed, cluster | dask |
Chunk size tuning
| Chunk size | Effect |
|---|---|
| 100–250 | Lower latency, higher dispatch overhead |
| 500 (default) | Balanced for most corpora |
| 1000–2000 | Higher throughput, more memory per batch |
Dask audit
Verify Dask integration health:
This runs dask_integration_report.py to test task serialization, worker connectivity, and result collection.
Production certification
Production audit runs scale benchmarks at workers 1 and 2, verifying throughput and hash stability.
Output hash is stable across worker counts and backends. Run indw validate after changing worker configuration to confirm parity on your acceptance corpus.
Zero logic changes
Pipeline stages, gates, and dedup logic are identical across all backends.
Hardware probe
Set INSTANT_MERGE_HW_PROBE=1 to auto-tune workers and chunk size from available CPU and memory.