Large Scale Image Text Pair Dataset Curation And Organization

1

LAION-5BDataset59/100

via “large-scale image-text pair dataset with clip-based quality filtering”

5.85 billion image-text pairs foundational for image generation.

Unique: Largest openly available image-text dataset (5.85B pairs) with pre-computed CLIP similarity scores for every pair, enabling quality-aware filtering without re-embedding; organized into language-specific clusters and distributed across multiple providers for redundancy and accessibility

vs others: 14x larger than LAION-400M and orders of magnitude larger than proprietary datasets (DALL-E, Imagen training data), with open access and no licensing restrictions, making it the de facto foundation for open-source image generation models

2

MS COCO (Common Objects in Context)Dataset59/100

via “image-to-text caption generation dataset with 5 natural language descriptions per image”

330K images with object detection, segmentation, and captions.

Unique: 5 captions per image (vs 1 in most datasets) captures linguistic diversity and enables robust evaluation of caption generation variability; 1.65M caption-image pairs provide scale for training large vision-language models

vs others: 5x more captions per image than Flickr30K (1 caption/image) enabling better linguistic diversity modeling; larger scale than Visual Genome (108K images) while maintaining natural language quality vs automated alt-text

3

ShareGPT4VDataset57/100

via “large-scale image-text pair dataset curation and organization”

1.2M image-text pairs with GPT-4V captions.

Unique: Provides a pre-curated 1.2M image-caption dataset with GPT-4V captions already generated and organized, eliminating the need for users to run expensive GPT-4V API calls themselves. The dataset is versioned and publicly available, enabling reproducible research and reducing barrier to entry for vision-language model training.

vs others: Larger and more detailed than COCO Captions (123K images) or Flickr30K (31K images) while providing GPT-4V-quality descriptions; more accessible than building custom datasets via API calls, which would cost thousands of dollars.

4

LLaVA-Instruct 150KDataset56/100

via “detailed image description dataset generation”

150K visual instruction examples for multimodal model training.

Unique: Generates descriptions at semantic depth beyond typical captions, including spatial relationships, object attributes, and scene composition. Uses GPT-4V's multimodal understanding to produce descriptions that capture visual nuance rather than surface-level object lists.

vs others: Produces richer training signal than automated caption datasets (COCO, Flickr30K) because GPT-4V understands visual semantics; stronger than human-annotated datasets at scale due to consistency and coverage, though potentially less diverse than crowdsourced descriptions.

5

InfinityRepository44/100

via “dataset preparation and image-text pair loading with flexible format support”

[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Unique: Implements dataset loading with automatic image tokenization using the Infinity VAE, eliminating separate preprocessing steps. Supports multiple metadata formats without requiring format conversion.

vs others: Integrated tokenization reduces preprocessing overhead compared to separate tokenization pipelines, and support for multiple formats eliminates format conversion steps.

6

Awesome-Text-to-ImageRepository37/100

via “dataset-resource-aggregation-and-metadata-indexing”

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

Unique: Centralizes dataset discovery in a single curated markdown file rather than scattered across individual papers, with explicit cross-references to papers that use each dataset. This enables practitioners to understand dataset provenance and see how datasets were used in published research, rather than discovering datasets only through paper reading.

vs others: More discoverable than searching individual papers for dataset citations, and more curated than generic dataset repositories (Hugging Face, Kaggle) because it focuses specifically on text-to-image datasets and includes research context for each dataset

7

MINT-1T-PDF-CC-2023-23Dataset24/100

via “multimodal image-text pair extraction from pdf documents at scale”

Dataset by mlfoundations. 6,33,111 downloads.

Unique: Combines 1T+ tokens of PDF-native multimodal data with WebDataset streaming architecture and MLCroissant metadata standards, enabling efficient distributed training without full dataset materialization — unlike image-text datasets that require pre-downloaded image files or separate text corpora

vs others: Larger scale and document-native structure than LAION or similar web-scraped image-text datasets, with preserved layout context that benefits document-specific tasks; more efficient streaming than datasets requiring separate image downloads

8

documentation-imagesDataset24/100

via “curated-documentation-image-dataset-loading”

Dataset by huggingface. 25,31,937 downloads.

Unique: Provides a pre-curated, versioned dataset of 24.4M documentation images integrated directly into HuggingFace's ecosystem with automatic caching and streaming, eliminating manual collection and organization overhead that competitors require

vs others: Larger and more specialized than generic image datasets (ImageNet, COCO) for documentation-specific tasks, and requires no custom scraping infrastructure unlike building a documentation image corpus from scratch

9

MINT-1T-PDF-CC-2024-18Dataset23/100

via “large-scale multimodal document-image dataset curation and indexing”

Dataset by mlfoundations. 10,34,415 downloads.

Unique: Combines PDF-level document structure preservation with extracted image-text pairs at 1T token scale, using Common Crawl's distributed crawl infrastructure and HuggingFace's streaming dataset format to avoid centralized storage bottlenecks — most competitors (e.g., LAION) focus on web images or require full downloads

vs others: Larger and more document-focused than LAION-5B or Conceptual Captions, with native PDF structure metadata enabling document-aware training; more accessible than proprietary datasets like Google's internal document corpora due to CC-BY-4.0 licensing and HuggingFace Hub distribution

10

MINT-1T-PDF-CC-2023-06Dataset23/100

via “large-scale multimodal document-image-text dataset curation and indexing”

Dataset by mlfoundations. 5,39,406 downloads.

Unique: Combines 1 trillion tokens of document text with aligned page-level images from a single Common Crawl snapshot, providing temporally-consistent multimodal pairs at unprecedented scale — most competing datasets either use synthetic image-text pairs or lack document-level coherence across modalities

vs others: Larger and more document-focused than LAION-5B (which emphasizes web images) and more naturally-paired than synthetic datasets like Synthetic Docvqa, with real-world OCR challenges that improve model robustness

11

MINT-1T-PDF-CC-2023-40Dataset23/100

via “paired image-text dataset construction for vision-language training”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Leverages natural document structure to create implicit image-text alignment without manual annotation, using page-level visual-semantic correspondence from PDFs. Unlike manually-annotated datasets (Flickr30K, COCO), derives pairs automatically from document layout, enabling trillion-token scale.

vs others: Provides orders of magnitude more image-text pairs than manually-curated datasets while maintaining document-specific semantic alignment that generic web image-text pairs (Laion) lack.

12

MINT-1T-PDF-CC-2023-14Dataset23/100

via “ocr-aligned image-text pair extraction from pdfs”

Dataset by mlfoundations. 5,72,108 downloads.

Unique: Provides 1T-token scale OCR-image pairs with automatic deduplication across Common Crawl snapshots, using content hashing to eliminate redundant pages — most document datasets (DocVQA, RVL-CDIP) manually curate smaller, domain-specific collections without cross-crawl deduplication

vs others: Scales to 5.7M documents with automated deduplication, whereas DocVQA (12K docs) and IIT-CDIP (6M pages) require manual curation or are domain-specific; offers broader diversity than academic paper datasets (arXiv, S2-ORC)

13

MINT-1T-PDF-CC-2023-50Dataset23/100

via “multimodal pdf-to-text extraction at scale”

Dataset by mlfoundations. 7,96,577 downloads.

Unique: Uses WebDataset tar-based streaming architecture instead of row-based formats, enabling efficient distributed training without downloading entire dataset; preserves PDF document structure and image-text spatial relationships rather than flattening to generic image-caption pairs

vs others: Larger and more diverse than LAION-5B for document-specific tasks, and preserves layout context that generic image-text datasets discard, making it superior for document intelligence vs. general vision-language training

14

banned-historical-archivesDataset23/100

via “historical-document-image-dataset-loading”

Dataset by banned-historical-archives. 18,46,708 downloads.

Unique: Combines authentic historical archival materials (not synthetic or modern document scans) with MLCroissant metadata standards, enabling reproducible dataset versioning and automated schema discovery — most document datasets lack this dual focus on authenticity and machine-readable provenance

vs others: Larger and more historically diverse than standard document datasets (MNIST, SVHN) while maintaining open-source accessibility and MLCroissant compliance for automated pipeline integration

15

LaionProduct

via “large-scale image-text dataset access”

Top Matches

Also Known As

Company