Deduplication
Exact hash, MinHash fuzzy, and semantic embedding deduplication at petabyte scale.
INDW deduplicates documents at three levels: exact content hash, fuzzy MinHash near-duplicate detection, and semantic embedding similarity. Dedup runs during the merge pipeline and state persists in the work directory for resume.
Dedup stack
| Layer | Algorithm | Threshold | Extra required |
|---|---|---|---|
| Exact | SHA-256 content hash | Exact match | None |
| Fuzzy | MinHash LSH | Jaccard ≥ 0.90 (default) | dedup |
| Semantic | Embedding + FAISS ANN | Hamming ≤ 4 (default) | ann, embedding |
Configuration
Exact dedup
Exact dedup uses content_hash from normalized text. The hash index is global across all sources in a merge run. Stage0 also performs fast per-batch exact doc dedup before heavy processing.
Fuzzy dedup (MinHash)
MinHash LSH detects near-duplicates with high Jaccard similarity. Install the dedup extra:
fuzzy_quality_margin keeps the higher-quality document when duplicates are found — the survivor is the one with the better quality score, not necessarily the first seen.
Semantic dedup
Semantic dedup embeds documents and queries a FAISS ANN index for paraphrase-level duplicates:
semantic_recent_jaccard_threshold applies a stricter threshold to recently seen documents, catching burst duplicates common in web crawls.
Within-document dedup
When enabled, dedup skips comparing chunks within the same document. Disable for corpora where intra-document repetition is a quality signal.
Resume and replay
Dedup indexes persist in the work directory. On resume, restore_merge_dedup_from_output rebuilds indexes from prior output. Use --fresh to reset all dedup state.
FAISS indexes grow with corpus size. For corpora above 100M documents, consider fuzzy-only dedup or shard semantic dedup across multiple merge runs with post-merge exact dedup.