Dataset Subset Creation And Curation

1

LAION-5BDataset60/100

5.85 billion image-text pairs foundational for image generation.

Unique: Enables reproducible subset creation by combining pre-computed metadata filters (CLIP scores, NSFW flags, watermark flags, language tags, aesthetic scores) without reprocessing images. Subsets can be created at dataset creation time or dynamically at training time.

vs others: Enables reproducible curation vs ad-hoc filtering; combines multiple quality signals (CLIP, NSFW, watermark, aesthetic) vs single-signal filtering; supports language-aware subsetting vs monolingual alternatives

2

ShareGPT4VDataset60/100

via “domain-specific dataset curation and subset extraction”

1.2M image-text pairs with GPT-4V captions.

Unique: Enables systematic curation of domain-specific subsets from 1.2M images using GPT-4V captions as semantic filters, allowing extraction of specialized datasets without manual domain annotation or external labeling services

vs others: More flexible than fixed domain-specific datasets (e.g., medical imaging datasets) which are typically small and expensive to create; leverages rich caption semantics for more accurate domain filtering than keyword-based approaches

3

mdm_depthDataset25/100

via “depth dataset filtering and subset selection by scene attributes”

Dataset by robbyant. 3,88,267 downloads.

Unique: Leverages HuggingFace datasets' lazy filtering to avoid full dataset materialization; enables efficient subset creation without downloading unused samples, critical for large-scale datasets

vs others: More efficient than downloading full dataset and filtering locally; more flexible than pre-split dataset versions that lock users into fixed train/val/test divisions

4

MINT-1T-PDF-CC-2023-40Dataset24/100

via “document-domain dataset sampling and filtering”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Provides streaming access with metadata-based filtering on trillion-token dataset without requiring full download, using Hugging Face Datasets infrastructure for efficient subset construction. Enables on-demand domain-specific corpus creation from larger collection.

vs others: More flexible than fixed-size domain datasets (e.g., ArXiv papers, legal documents) by allowing dynamic filtering from larger corpus; more efficient than downloading full dataset for subset access.

5

EncordProduct

via “data-curation-and-filtering”

6

ActiveLoop.aiProduct

via “efficient data sampling and subset creation”

7

LaionProduct

via “filtered dataset subset creation”

8

V7Product

via “dataset-filtering-and-sampling”

9

Dataset MarketplaceProduct

via “dataset customization and filtering”

Top Matches

Also Known As

Company