Installation
Install INDW and optional extras for language detection, dedup, embeddings, and distributed execution.
INDW requires Python 3.10 or later. Install the base package from PyPI, then add optional extras for the capabilities your pipeline needs.
PyPI install
indw doctor prints the installed version, resolved execution backend, and availability of key dependencies (orjson, trafilatura, dask).
Development install
Clone the repository and install in editable mode with development dependencies:
Or use Make:
Optional extras
Install extras only when needed. Each extra unlocks a specific subsystem without pulling unnecessary dependencies.
| Extra | Capability | Packages |
|---|---|---|
language | Language detection | langid |
dedup | Fuzzy MinHash dedup | datasketch |
ann | ANN index for semantic dedup | faiss-cpu |
embedding | Sentence-transformer providers | sentence-transformers |
distributed | Dask cluster backend | dask, distributed |
monitor | Prometheus and OpenTelemetry | prometheus_client, opentelemetry |
compress | ZSTD compression | zstandard |
all | All optional capabilities | all of the above |
For full production pipelines with semantic dedup and cluster execution:
System requirements
| Resource | Minimum | Recommended |
|---|---|---|
| Python | 3.10 | 3.11 or 3.12 |
| RAM | 4 GB | 16 GB+ for 10GB+ corpora |
| Disk | 2× corpus size | SSD for work directories |
| CPU | 2 cores | 8+ cores with --workers |
faiss-cpu wheels may not be available on all ARM platforms. Use exact + fuzzy dedup only, or build FAISS from source.
Verify installation
Expected output:
If dask=missing and you need distributed execution, install the distributed extra.
Docker
Build from the repository Dockerfile for reproducible CI and cluster images: