Duplicate And Near Duplicate Detection

1

Nomic EmbedRepository61/100

via “duplicate detection and deduplication across embeddings”

Open-source embedding models with full transparency.

Unique: Implements semantic deduplication using embedding similarity rather than string matching, enabling detection of paraphrased or reformatted duplicates. Integrates with Atlas visualization to show duplicate clusters interactively.

vs others: Detects semantic duplicates that string-based tools (fuzzy matching, exact hashing) would miss, and provides interactive exploration of duplicate groups rather than just lists.

2

ElicitAgent59/100

via “paper-similarity-and-duplicate-detection”

AI agent for automated systematic literature reviews.

Unique: Combines metadata-based exact matching with embedding-based semantic similarity for duplicate detection, rather than relying on single approach, enabling detection of both exact duplicates and near-duplicates

vs others: More robust than metadata-only matching because it catches semantic duplicates, and more efficient than manual deduplication because it automates the process

3

StarCoder DataDataset57/100

via “near-deduplication and exact deduplication with semantic similarity detection”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Two-stage deduplication (exact + near) with MinHash-based similarity detection tuned for code semantics, rather than generic text deduplication — preserves code-specific patterns like function signatures while removing boilerplate

vs others: More aggressive deduplication than CodeSearchNet (which uses only exact matching) and more code-aware than generic text dedup, reducing training data size by ~30-40% while maintaining diversity

4

all-MiniLM-L6-v2Model51/100

via “semantic-duplicate-detection”

feature-extraction model by undefined. 32,39,437 downloads.

Unique: Detects semantic duplicates (paraphrases, rewording) rather than exact or fuzzy matches — leverages BERT's understanding of semantic equivalence to catch duplicates that keyword-based approaches miss, with configurable similarity thresholds for domain-specific tuning

vs others: More accurate than Levenshtein distance or fuzzy string matching for paraphrased content; faster than cross-encoder reranking because it uses pre-computed embeddings; simpler than training custom duplicate detection models because it requires no labeled data

5

Nomic Embed Text (137M)Model25/100

via “semantic deduplication and near-duplicate detection”

Nomic's embedding model — semantic search and similarity — embedding model

Unique: Performs semantic deduplication without lexical matching, capturing paraphrases and translations that string-based methods miss. Local execution enables processing sensitive documents without external API calls.

vs others: More robust than hash-based or string-similarity deduplication for handling paraphrasing and translation; faster than manual review while maintaining semantic understanding unlike simple string matching.

6

finewebDataset25/100

via “deduplication at document and near-duplicate levels”

Dataset by HuggingFaceFW. 6,43,166 downloads.

Unique: Applies both exact and near-duplicate deduplication at Common Crawl scale with explicit benchmark contamination prevention, ensuring evaluation integrity — most web corpora lack deduplication or benchmark-aware filtering

vs others: Prevents benchmark leakage that affects model evaluation fairness, whereas raw Common Crawl and many other corpora do not address this issue

7

XimilarProduct

via “duplicate-product-detection”

8

AfterShootProduct

via “duplicate and near-duplicate detection”

9

SupermemoryProduct

via “duplicate-content-detection”

10

Receiptor.aiProduct

via “duplicate-receipt-detection”

Top Matches

Also Known As

Company