Instant DevelopersGet StartedInstallation

Installation

|||

Install INDW and optional extras for language detection, dedup, embeddings, and distributed execution.

INDW requires Python 3.10 or later. Install the base package from PyPI, then add optional extras for the capabilities your pipeline needs.

PyPI install

bash
pip install indw
indw doctor

indw doctor prints the installed version, resolved execution backend, and availability of key dependencies (orjson, trafilatura, dask).

Development install

Clone the repository and install in editable mode with development dependencies:

bash
git clone https://github.com/InstantAI-Labs/InstantAI-Data-Workflow.git
cd indw
pip install -e ".[dev,language]"
indw doctor

Or use Make:

bash
make install-dev
indw doctor

Optional extras

Install extras only when needed. Each extra unlocks a specific subsystem without pulling unnecessary dependencies.

ExtraCapabilityPackages
languageLanguage detectionlangid
dedupFuzzy MinHash dedupdatasketch
annANN index for semantic dedupfaiss-cpu
embeddingSentence-transformer providerssentence-transformers
distributedDask cluster backenddask, distributed
monitorPrometheus and OpenTelemetryprometheus_client, opentelemetry
compressZSTD compressionzstandard
allAll optional capabilitiesall of the above
bash
pip install "indw[language,dedup,distributed]"

For full production pipelines with semantic dedup and cluster execution:

bash
pip install -e ".[all]"

System requirements

ResourceMinimumRecommended
Python3.103.11 or 3.12
RAM4 GB16 GB+ for 10GB+ corpora
Disk2× corpus sizeSSD for work directories
CPU2 cores8+ cores with --workers
FAISS on ARM:

faiss-cpu wheels may not be available on all ARM platforms. Use exact + fuzzy dedup only, or build FAISS from source.

Verify installation

bash
indw doctor

Expected output:

text
indw=1.0 python=3.11.x platform=...
backend=multiprocess resolved=multiprocess
orjson=ok
trafilatura=ok
dask=ok

If dask=missing and you need distributed execution, install the distributed extra.

Docker

Build from the repository Dockerfile for reproducible CI and cluster images:

bash
docker build -t indw:latest .
docker run --rm -v $(pwd)/raw:/data/raw -v $(pwd)/out:/data/out indw:latest \
  indw merge /data/raw /data/out/filtered.jsonl --work-dir /data/work --fresh
© 2026 Instant Developers. All rights reserved.