Instant DevelopersGuidesDistributed Execution

Distributed Execution

|||

Scale INDW pipelines to Dask clusters and multiprocess workers without changing pipeline logic.

INDW scales horizontally through execution backends. The same pipeline logic runs on a laptop or a Dask cluster — only the transport layer changes.

Backends

BackendEnv valueWorkersBest for
localINSTANT_PIPELINE_BACKEND=local1Debugging, tracing
threadINSTANT_PIPELINE_BACKEND=threadN threadsI/O-bound ingest
multiprocessDefaultN processesCPU-bound cleaning (default)
daskINSTANT_PIPELINE_BACKEND=daskCluster100GB+ corpora
bash
indw doctor  # shows resolved backend

Multiprocess scaling

Increase worker count on a single machine:

bash
indw merge ./raw ./out/filtered.jsonl --workers 8 --backend multiprocess

Output hash remains stable across worker counts when configuration is unchanged. Verify with:

bash
indw validate

Dask cluster

Install distributed support:

bash
pip install "indw[distributed]"

Connect to a running Dask scheduler:

bash
export INSTANT_PIPELINE_BACKEND=dask
export INSTANT_DASK_SCHEDULER=tcp://scheduler:8786
indw merge ./raw ./out/filtered.jsonl --workers 16

Or use the standard Dask env var:

bash
export DASK_SCHEDULER_ADDRESS=tcp://scheduler:8786

Example: local Dask cluster

python
from dask.distributed import Client
import os
 
client = Client(n_workers=4, threads_per_worker=2)
os.environ["INSTANT_PIPELINE_BACKEND"] = "dask"
os.environ["INSTANT_DASK_SCHEDULER"] = client.scheduler.address
 
from indw.schedule import merge_with_quality
merge_with_quality("./raw", "./out/filtered.jsonl", workers=8, work_dir="./work")

See examples/merge_dask.py for a complete example.

Backend resolution

python
from indw.schedule.backends.config import pipeline_execution_backend
from indw.schedule.backends.factory import resolve_execution_backend
 
print(pipeline_execution_backend())  # "multiprocess"
print(resolve_execution_backend().name)  # resolved backend instance

Aliases are normalized automatically:

InputResolved
mp, process, spawnmultiprocess
threads, threadedthread
sync, inline, directlocal
distributed, clusterdask

Chunk size tuning

bash
indw merge ./raw ./out/filtered.jsonl --workers 8 --chunk-size 1000
Chunk sizeEffect
100–250Lower latency, higher dispatch overhead
500 (default)Balanced for most corpora
1000–2000Higher throughput, more memory per batch

Dask audit

Verify Dask integration health:

bash
indw audit --kind dask

This runs dask_integration_report.py to test task serialization, worker connectivity, and result collection.

Production certification

bash
indw audit --kind production
indw benchmark

Production audit runs scale benchmarks at workers 1 and 2, verifying throughput and hash stability.

Hash stability:

Output hash is stable across worker counts and backends. Run indw validate after changing worker configuration to confirm parity on your acceptance corpus.

Zero logic changes

Pipeline stages, gates, and dedup logic are identical across all backends.

Hardware probe

Set INSTANT_MERGE_HW_PROBE=1 to auto-tune workers and chunk size from available CPU and memory.

© 2026 Instant Developers. All rights reserved.