Deduplication

INDW deduplicates documents at three levels: exact content hash, fuzzy MinHash near-duplicate detection, and semantic embedding similarity. Dedup runs during the merge pipeline and state persists in the work directory for resume.

Dedup stack

Layer	Algorithm	Threshold	Extra required
Exact	SHA-256 content hash	Exact match	None
Fuzzy	MinHash LSH	Jaccard ≥ 0.90 (default)	`dedup`
Semantic	Embedding + FAISS ANN	Hamming ≤ 4 (default)	`ann`, `embedding`

Configuration

yaml

dedup:
  exact: true
  fuzzy: true
  fuzzy_threshold: 0.90
  fuzzy_num_perm: 128
  fuzzy_quality_margin: 0.05
  semantic: true
  semantic_hamming_threshold: 4
  semantic_jaccard_threshold: 0.72
  semantic_recent_jaccard_threshold: 0.85
  skip_within_document_chunks: true
  embedding:
    model: "sentence-transformers/all-MiniLM-L6-v2"
    batch_size: 64
    device: "cpu"

Exact dedup

Exact dedup uses content_hash from normalized text. The hash index is global across all sources in a merge run. Stage0 also performs fast per-batch exact doc dedup before heavy processing.

python

from indw.dedup.normalize import content_hash
 
h = content_hash("Normalized document text")

Fuzzy dedup (MinHash)

MinHash LSH detects near-duplicates with high Jaccard similarity. Install the dedup extra:

bash

pip install "indw[dedup]"

fuzzy_quality_margin keeps the higher-quality document when duplicates are found — the survivor is the one with the better quality score, not necessarily the first seen.

Semantic dedup

Semantic dedup embeds documents and queries a FAISS ANN index for paraphrase-level duplicates:

bash

pip install "indw[ann,embedding]"

yaml

dedup:
  semantic: true
  embedding:
    model: "sentence-transformers/all-MiniLM-L6-v2"
    batch_size: 64

semantic_recent_jaccard_threshold applies a stricter threshold to recently seen documents, catching burst duplicates common in web crawls.

Within-document dedup

yaml

dedup:
  skip_within_document_chunks: true

When enabled, dedup skips comparing chunks within the same document. Disable for corpora where intra-document repetition is a quality signal.

Resume and replay

Dedup indexes persist in the work directory. On resume, restore_merge_dedup_from_output rebuilds indexes from prior output. Use --fresh to reset all dedup state.

Semantic dedup memory:

FAISS indexes grow with corpus size. For corpora above 100M documents, consider fuzzy-only dedup or shard semantic dedup across multiple merge runs with post-merge exact dedup.