Document Clustering And Deduplication

1

RedPajama v2Dataset61/100

via “document-level deduplication with hash-based matching”

30 trillion token web dataset with 40+ quality signals per document.

Unique: Uses document-level hash-based deduplication (preserving document boundaries) rather than token-level or fuzzy matching, enabling reproducible filtering and transparent deduplication hashes that users can inspect and verify. Processes 84 CommonCrawl dumps with consistent deduplication methodology.

vs others: Document-level deduplication is more interpretable and reproducible than token-level approaches, and the published deduplication hashes enable users to understand and verify which documents were removed, unlike proprietary datasets that hide deduplication decisions.

2

The Stack v2Dataset59/100

via “content-based deduplication at file and repository levels”

67 TB permissively licensed code dataset across 600+ languages.

Unique: Two-stage deduplication combining exact hash matching with fuzzy similarity matching (likely MinHash or Jaccard) to catch both identical and near-identical code — more thorough than single-stage approaches but computationally expensive

vs others: More aggressive deduplication than CodeSearchNet (which uses simple hash matching) because it catches near-duplicates, but less semantic than clone detection tools (which understand code structure) because it's content-based

3

Nomic EmbedRepository59/100

via “duplicate detection and deduplication across embeddings”

Open-source embedding models with full transparency.

Unique: Implements semantic deduplication using embedding similarity rather than string matching, enabling detection of paraphrased or reformatted duplicates. Integrates with Atlas visualization to show duplicate clusters interactively.

vs others: Detects semantic duplicates that string-based tools (fuzzy matching, exact hashing) would miss, and provides interactive exploration of duplicate groups rather than just lists.

4

ElicitAgent59/100

via “paper-similarity-and-duplicate-detection”

AI agent for automated systematic literature reviews.

Unique: Combines metadata-based exact matching with embedding-based semantic similarity for duplicate detection, rather than relying on single approach, enabling detection of both exact duplicates and near-duplicates

vs others: More robust than metadata-only matching because it catches semantic duplicates, and more efficient than manual deduplication because it automates the process

5

FineWebDataset58/100

via “minhash-based deduplication at petabyte scale”

Hugging Face's 15T token dataset, new standard for LLM training.

Unique: Uses MinHash locality-sensitive hashing for memory-efficient duplicate detection across 15 trillion tokens, avoiding the need to store full document hashes or maintain a global hash table. This enables processing at petabyte scale where naive approaches would exhaust available memory.

vs others: More memory-efficient than exact deduplication (which requires storing full hashes) and faster than string-similarity-based approaches (which require pairwise comparisons), making it practical for web-scale datasets where C4 and similar datasets use simpler or less effective deduplication strategies.

6

C4 (Colossal Clean Crawled Corpus)Dataset57/100

via “sentence-level deduplication at scale”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Applies sentence-level deduplication at scale across 750GB using deterministic techniques, removing redundant training examples while maintaining document structure; enables cleaner training data without requiring learned quality models

vs others: More thorough than document-level deduplication; simpler and more reproducible than semantic deduplication approaches; reduces training data size but may miss near-duplicates that learned methods would catch

7

StarCoder DataDataset57/100

via “near-deduplication and exact deduplication with semantic similarity detection”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Two-stage deduplication (exact + near) with MinHash-based similarity detection tuned for code semantics, rather than generic text deduplication — preserves code-specific patterns like function signatures while removing boilerplate

vs others: More aggressive deduplication than CodeSearchNet (which uses only exact matching) and more code-aware than generic text dedup, reducing training data size by ~30-40% while maintaining diversity

8

MemOSMCP Server54/100

via “memory quality assurance and deduplication”

AI memory OS for LLM and Agent systems(moltbot,clawdbot,openclaw), enabling persistent Skill memory for cross-task skill reuse and evolution.

Unique: Implements asynchronous deduplication with configurable merge strategies and embedding-based similarity detection, running as a background scheduler task — unlike manual deduplication, MemOS automates duplicate detection and merging.

vs others: Prevents memory bloat through automatic deduplication; requires careful threshold tuning to avoid false positives (merging distinct memories) or false negatives (missing duplicates).

9

multilingual-e5-smallModel53/100

via “language-agnostic semantic clustering and deduplication”

sentence-similarity model by undefined. 70,32,108 downloads.

Unique: Leverages multilingual-e5-small's shared embedding space to cluster texts across 94 languages without language-specific preprocessing or translation. The model's contrastive training ensures semantically equivalent texts cluster together regardless of language, enabling language-agnostic deduplication and grouping.

vs others: More accurate than lexical deduplication (string matching, fuzzy matching) for semantic equivalence; faster than translation-based approaches; supports 94 languages in a single model vs. language-specific clustering pipelines.

10

multilingual-e5-baseModel51/100

sentence-similarity model by undefined. 36,60,082 downloads.

Unique: Operates on multilingual embeddings in a unified space, enabling clustering that respects semantic similarity across languages rather than creating separate clusters for each language — a Spanish document about 'cars' clusters with an English document about 'automobiles' rather than with other Spanish documents

vs others: More accurate than TF-IDF or BM25-based clustering for semantic grouping, and requires no language-specific preprocessing unlike traditional NLP clustering pipelines

11

all-MiniLM-L6-v2Model51/100

via “semantic-clustering-and-deduplication”

feature-extraction model by undefined. 32,39,437 downloads.

Unique: Leverages distilled BERT's semantic embedding space to enable clustering without domain-specific feature engineering — the 384-dimensional space is optimized for semantic similarity, making clustering more effective than generic embeddings or TF-IDF vectors

vs others: More accurate than keyword-based deduplication (fuzzy matching, Levenshtein distance) because it captures semantic meaning; faster than cross-encoder reranking because it uses pre-computed embeddings; simpler than topic modeling (LDA) because it requires no hyperparameter tuning for vocabulary

12

strixRepository50/100

via “centralized vulnerability deduplication and correlation”

Open-source AI hackers to find and fix your app’s vulnerabilities.

Unique: Uses LLM-powered semantic comparison for vulnerability deduplication rather than exact string matching, enabling correlation of related findings with different descriptions or exploitation paths. Implements centralized aggregation across all agents and tools.

vs others: Reduces false positives and noise in reports compared to simple string-based deduplication, and provides better correlation than manual review, though less explainable than rule-based systems.

13

OSS AI agent that indexes and searches the Epstein filesAgent43/100

via “document similarity and clustering for pattern discovery”

Hi HN,I built an open-source AI agent that has already indexed and can search the entire Epstein files, roughly 100M words of publicly released documents.The goal was simple: make a large, messy corpus of PDFs and text files immediately searchable in a precise way, without relying on keyword search

Unique: Applies clustering to investigative document corpora to surface hidden patterns and document relationships without requiring explicit queries, likely using approximate nearest neighbor search for scalability

vs others: Discovers patterns that keyword search would miss because it operates on semantic similarity rather than explicit terms, enabling exploration of unknown document collections

14

Claude-File-Recovery, recover files from your ~/.claude sessionsCLI Tool41/100

via “file deduplication and conflict resolution”

Claude Code deleted my research and plan markdown files and informed me: “I accidentally rm -rf'd real directories in my Obsidian vault through a symlink it didn't realize was there: I made a mistake. “Unfortunately the backup of my documentation accidentally hadn’t run for a month. So I b

Unique: Implements intelligent deduplication at recovery time rather than requiring manual cleanup afterward, using content hashing to identify true duplicates vs. files with the same name but different content

vs others: Prevents data loss from overwriting files during recovery — generic file recovery tools often blindly overwrite or fail on conflicts, while this tool preserves all versions with clear naming

15

q1-crafter-mcpMCP Server38/100

via “intelligent deduplication”

<p align="center"> <img src="https://img.shields.io/badge/MCP-Server-blueviolet?style=for-the-badge&logo=anthropic" alt="MCP Server" /> <img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&logo=python&logoColor=white" alt="Python" /> <img src="https://img.shields.io/b

Unique: Combines exact DOI matching with fuzzy title matching to ensure high accuracy in deduplication, which is often not available in simpler tools.

vs others: More robust than basic deduplication tools that rely solely on exact matches, reducing the risk of overlooking duplicates.

16

@contractspec/lib.support-botFramework37/100

via “semantic ticket deduplication and linking”

AI support bot framework with RAG and ticket management

Unique: Applies semantic clustering to support tickets rather than keyword matching, enabling detection of duplicate issues phrased differently by different customers

vs others: Catches semantic duplicates that keyword-based deduplication misses, but requires embedding infrastructure and threshold tuning vs simple string matching

17

vectoriadbRepository33/100

via “similarity-based document clustering and grouping”

VectoriaDB - A lightweight, production-ready in-memory vector database for semantic search

Unique: Provides unsupervised document grouping based purely on embedding similarity without requiring labeled training data or pre-defined categories; integrates clustering directly into vector store API rather than requiring external ML libraries

vs others: More convenient than calling scikit-learn separately, but less sophisticated than dedicated clustering libraries with advanced algorithms (DBSCAN, Gaussian mixtures) and visualization tools

18

@membank/coreRepository29/100

via “similarity-based memory deduplication with configurable thresholds”

Core library for membank — handles storage, embeddings, deduplication, and semantic search.

Unique: Performs deduplication at insertion time using embedding similarity rather than exact matching, catching semantic duplicates that keyword-based deduplication would miss. Threshold configuration allows tuning sensitivity without code changes.

vs others: More effective than hash-based deduplication because it catches semantically similar memories even with different wording, whereas exact matching only catches identical text.

19

ClaygentAgent26/100

via “multi-page data aggregation and deduplication”

Agent that scrapes and summarize data from the web

Unique: Combines vision-based page understanding with semantic deduplication logic that recognizes duplicate records across formatting variations and source inconsistencies, rather than relying on exact field matching or manual merge rules

vs others: More intelligent than traditional ETL deduplication because it understands semantic equivalence (e.g., 'John Smith' and 'J. Smith' as the same person) rather than requiring exact string matches or regex patterns

20

@kuindji/memory-domainRepository26/100

via “memory deduplication and conflict resolution”

Domain-driven memory engine with graph storage, embeddings, and semantic search

Unique: Implements deduplication at the domain level with custom conflict resolution rules, rather than as a generic data cleaning step, allowing domain-specific logic (e.g., 'contradicting memories are valuable, don't merge them')

vs others: More flexible than database-level deduplication (unique constraints) because it supports fuzzy matching and custom merge logic; more sophisticated than simple hash-based deduplication because it understands semantic similarity

Top Matches

Also Known As

Company