Collaborative Dataset Sharing And Version Control

1

The Stack v2Dataset59/100

via “dataset versioning and reproducibility tracking”

67 TB permissively licensed code dataset across 600+ languages.

Unique: Maintains semantic versioning and detailed changelogs for dataset releases, enabling researchers to cite specific versions and understand dataset evolution — more rigorous than one-off dataset releases without versioning

vs others: More reproducible than academic datasets that are released once without versioning, and more transparent than commercial datasets (Codex) that don't disclose version history or changes

2

StarCoder DataDataset57/100

via “dataset versioning and reproducibility tracking”

783 GB curated code dataset from 86 languages with PII redaction.

Unique: Maintains versioned snapshots with full provenance tracking (processing parameters, deduplication thresholds, opt-outs) enabling reproducible model training and dataset auditing. Treats dataset composition as a first-class artifact requiring version control and documentation.

vs others: More reproducible than static dataset releases because it documents exact processing parameters and enables version-specific citations, allowing researchers to understand how dataset changes affect model behavior and supporting scientific reproducibility.

3

ClearMLRepository56/100

via “dataset versioning and artifact management with content-addressable storage”

Open-source MLOps — experiment tracking, pipelines, data management, auto-logging, self-hosted.

Unique: Implements content-addressable storage with SHA256-based deduplication across datasets, automatically tracking dataset lineage and associating versions with experiments via the Task context, supporting multi-cloud backends (S3, GCS, Azure) with unified API

vs others: Provides tighter integration with experiment tracking than DVC (which is primarily a Git-based versioning tool) and lower operational overhead than Pachyderm (which requires Kubernetes), though lacks DVC's Git-native workflow

4

deeplakeMCP Server55/100

via “version control for datasets with branching and tagging”

Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.

Unique: Applies Git-like version control semantics to datasets rather than code, with commits, branches, and tags stored as delta snapshots rather than full copies. Enables collaborative dataset curation workflows where teams branch independently and merge changes, with conflict detection on overlapping tensor modifications.

vs others: More sophisticated than simple dataset snapshots (like DVC) because it supports branching and merging; more efficient than full-copy versioning because it stores only deltas between versions, reducing storage by 70-90% for typical workflows.

5

Powerdrill AIAgent29/100

via “collaborative data job development with version control”

AI agent that completes your data job 10x faster

Unique: Applies Git-like version control to data job specifications and results, enabling collaborative development with full audit trails and conflict resolution for non-technical users

vs others: More accessible than Git-based workflows because it abstracts version control for non-engineers; more comprehensive than simple job sharing because it includes audit trails and conflict resolution

6

Hugging face datasetsDataset27/100

via “dataset versioning and reproducibility with commit-based tracking”

[Slack](https://camel-kwr1314.slack.com/join/shared_invite/zt-1vy8u9lbo-ZQmhIAyWSEfSwLCl2r2eKA#/shared-invite/email)

Unique: Uses content-addressed storage with commit hashes derived from dataset contents and transformation DAGs, enabling automatic deduplication of identical datasets across versions. Integrates with Hugging Face Hub's Git-based infrastructure for seamless version management without separate tooling.

vs others: More integrated with ML workflows than DVC (Data Version Control) because it's built into the Hugging Face ecosystem and doesn't require separate Git LFS setup, while providing stronger reproducibility guarantees than manual versioning.

7

datasetsDataset26/100

via “dataset versioning and hub repository management with git-based tracking”

HuggingFace community-driven open-source library of datasets

Unique: Integrates Git-based version control with Hugging Face Hub for dataset versioning, using Git LFS for efficient large file storage. The system automatically manages dataset cards and metadata, providing a unified interface for dataset publication and collaboration.

vs others: More integrated than manual Git workflows; provides automatic dataset card generation unlike raw Git repositories; Hub integration enables discoverability unlike private Git repos.

8

documentation-imagesDataset25/100

via “version-control-and-reproducibility”

Dataset by huggingface. 25,31,937 downloads.

Unique: Leverages HuggingFace's git-based versioning infrastructure to provide dataset version control as a first-class feature, eliminating the need for manual snapshot management or external version control systems

vs others: More integrated than external version control (DVC, Pachyderm) because versioning is built into the dataset platform itself, and more transparent than snapshot-based systems because full git history is queryable

9

DataLineRepository25/100

via “collaborative query sharing and version control”

An AI-driven data analysis and visualization tool. [#opensource](https://github.com/RamiAwar/dataline)

Unique: Implements query-level version control and sharing within the data analysis tool, avoiding the need for external Git repositories. Likely uses a fork/branch model similar to GitHub for query variants.

vs others: More integrated than storing queries in Git or shared drives, though less powerful than full Git workflows with merge conflict resolution

10

medical-qa-shared-task-v1-toyDataset25/100

via “dataset versioning and reproducible snapshot loading”

Dataset by lavita. 5,55,826 downloads.

Unique: Leverages HuggingFace Hub's Git-based versioning infrastructure to provide immutable dataset snapshots with full history tracking. Enables citation-grade reproducibility through semantic versioning and automatic version pinning in code.

vs others: More reproducible than ad-hoc dataset downloads because versions are immutable and citable; better than manual versioning because Git history is automatically maintained and queryable

11

Open NotebookRepository25/100

via “collaborative-notebook-sharing-and-versioning”

An open source implementation of NotebookLM with more flexibility and features. [#opensource](https://github.com/lfnovo/open-notebook)

Unique: Open-source implementation enables custom version control backends and collaboration protocols, whereas NotebookLM likely uses proprietary sharing. Supports self-hosted deployment for privacy-sensitive team collaboration.

vs others: Provides transparent version control and collaboration infrastructure that can be audited and customized, compared to NotebookLM's likely proprietary sharing mechanism.

12

hellaswagDataset25/100

via “dataset-versioning-and-reproducible-snapshot-management”

Dataset by Rowan. 3,02,991 downloads.

Unique: Leverages HuggingFace Hub's Git-based versioning to provide immutable dataset snapshots with automatic caching and rollback support, without requiring separate version control infrastructure

vs others: More convenient than manual dataset versioning (Git, DVC) and simpler than data warehouse versioning, with tight integration to HuggingFace's ecosystem and automatic caching

13

documentation-imagesDataset25/100

via “huggingface-hub-dataset-versioning-and-updates”

Dataset by huggingface-course. 2,84,036 downloads.

Unique: Leverages Hugging Face Hub's Git-LFS backed versioning system to provide immutable dataset snapshots with full commit history, enabling reproducible research and automated tracking of dataset evolution. This approach integrates dataset versioning with model versioning in the same Hub infrastructure.

vs others: More reproducible than datasets hosted on generic cloud storage (S3, GCS) because version history is tracked automatically and linked to model/paper artifacts in the Hub ecosystem, reducing friction for researchers reproducing published results.

14

upload2Dataset24/100

via “dataset versioning and reproducibility tracking”

Dataset by Maynor996. 6,62,770 downloads.

Unique: Integrates with HuggingFace Hub's Git-based version control system, storing dataset snapshots as immutable commits with full lineage tracking; revision hashes are cryptographically bound to exact image binaries and metadata, preventing silent data mutations

vs others: Provides stronger reproducibility guarantees than manual dataset versioning or cloud storage buckets because version pinning is enforced at the Hub API level, not just in documentation or configuration files

15

wikitextDataset24/100

via “dataset versioning and reproducibility tracking via huggingface hub”

Dataset by Salesforce. 12,88,015 downloads.

Unique: Integrates Git-based version control with HuggingFace Hub's immutable dataset storage, enabling semantic versioning and reproducible pinning without custom version management infrastructure; dataset cards provide transparent documentation of preprocessing and licensing

vs others: More reproducible than raw Wikipedia snapshots or ad-hoc dataset distributions; more transparent than proprietary datasets with opaque versioning; enables direct reproducibility of published results via version pinning

16

CADS-datasetDataset24/100

via “reproducible dataset versioning and citation tracking”

Dataset by mrmrx. 11,96,921 downloads.

Unique: Integrates HuggingFace Hub versioning with arXiv paper reference (2507.22953), enabling immutable dataset snapshots tied to published research — critical for medical imaging where reproducibility and regulatory compliance require auditable data lineage

vs others: More robust than manual version control (e.g., git-lfs) because HuggingFace Hub provides built-in deduplication and CDN distribution; more discoverable than private dataset repositories because Hub integration enables automatic citation tracking and community access

17

OpenThoughts-1k-sampleDataset24/100

via “reasoning dataset versioning and reproducibility tracking”

Dataset by ryanmarten. 5,99,055 downloads.

Unique: Leverages HuggingFace Hub's git-based versioning system combined with arxiv paper reference to provide both technical reproducibility (exact data version) and academic provenance (citable paper), a pattern uncommon in dataset distributions

vs others: More reproducible than static dataset snapshots because versions are tracked in git; more academically rigorous than datasets without paper references because arxiv link enables citation and methodology verification

18

finephraseDataset24/100

via “reproducible-dataset-versioning-and-caching”

Dataset by HuggingFaceFW. 4,74,259 downloads.

Unique: Uses HuggingFace Hub's Git-based versioning infrastructure to provide content-addressed dataset snapshots, enabling reproducible access without manual version management. Integrates with HuggingFace's distributed caching system, allowing teams to share cached datasets across machines.

vs others: More reproducible than manually hosted datasets because versioning is automatic and immutable; more efficient than re-downloading because local caching with integrity verification prevents data corruption.

19

img_uploadDataset23/100

via “dataset versioning and reproducibility tracking via huggingface hub”

Dataset by Maynor996. 6,17,655 downloads.

Unique: Uses HuggingFace Hub's Git-based versioning with LFS support for large files, enabling immutable dataset snapshots with commit-level granularity — differentiates from snapshot-based versioning (e.g., S3 versioning) by providing semantic version control with commit messages and author tracking

vs others: More reproducible than datasets without versioning because specific revisions are resolvable and immutable; simpler than maintaining local dataset copies because versioning is managed centrally on Hub with automatic deduplication

20

ps2_hf2Dataset23/100

via “dataset versioning and tracking”

Dataset by HennyPr. 5,41,353 downloads.

Unique: Incorporates a detailed version control mechanism that logs every change, providing a comprehensive history of dataset evolution.

vs others: More robust than typical dataset management systems, which often lack detailed version tracking.

Top Matches

Also Known As

Company