Extensions
INDW optional capability extras: language detection, dedup, embeddings, distributed, and monitoring.
INDW uses optional dependency extras to keep the base install lightweight. Install only the capabilities your pipeline requires.
Installing extras
Development install with all extras:
Extra reference
| Extra | Packages | Unlocks |
|---|---|---|
language | langid | LanguagePolicyConfig, fast language detection |
dedup | datasketch | MinHash fuzzy dedup LSH |
ann | faiss-cpu | FAISS ANN index for semantic dedup |
embedding | sentence-transformers | Embedding providers for semantic dedup |
distributed | dask, distributed | dask execution backend |
monitor | prometheus_client, opentelemetry-* | Metrics and tracing hooks |
compress | zstandard | ZSTD-compressed I/O streams |
dev | pytest, pytest-cov, pytest-xdist | Test suite |
Language detection
Without the language extra, language gating is disabled and all documents pass the language check.
Fuzzy dedup
Semantic dedup
Distributed execution
Enables INSTANT_PIPELINE_BACKEND=dask. Verify with:
Monitoring
Enables Prometheus metrics export and OpenTelemetry tracing hooks in the scheduler and stage pools.
Verifying extras
Doctor reports ok or missing for orjson, trafilatura, and dask. For other extras, attempt the relevant operation and check for ImportError.
For most production pipelines: pip install "indw[language,dedup,distributed]". Add ann and embedding only when semantic dedup is enabled in your quality profile.