{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40","slug":"mlfoundations--mint-1t-pdf-cc-2023-40","name":"MINT-1T-PDF-CC-2023-40","type":"dataset","url":"https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-40","page_url":"https://unfragile.ai/mlfoundations--mint-1t-pdf-cc-2023-40","categories":["model-training"],"tags":["task_categories:image-to-text","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:100B<n<1T","arxiv:2406.11271","region:us","multimodal"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_0","uri":"capability://data.processing.analysis.multimodal.document.to.text.extraction.at.scale","name":"multimodal document-to-text extraction at scale","description":"Extracts text content from 1 trillion tokens of PDF documents using OCR and layout-aware parsing, preserving document structure and spatial relationships. The dataset combines Common Crawl PDF snapshots with machine-readable text extraction, enabling training of models that understand both visual layout and semantic content. Architecture uses distributed PDF processing pipelines to handle heterogeneous document formats (scanned PDFs, native PDFs, mixed content) across 857K+ document samples.","intents":["Train vision-language models that understand document structure and layout","Build OCR systems that preserve formatting and spatial relationships","Create datasets for document understanding and information extraction tasks","Develop models that can reason about both text content and visual presentation"],"best_for":["ML researchers training document understanding models","Teams building enterprise document processing pipelines","Researchers working on multimodal vision-language models","Organizations needing large-scale OCR training data"],"limitations":["Dataset is 100B-1T tokens in size — requires significant storage (terabyte-scale infrastructure) and computational resources for full training","PDF quality varies across Common Crawl sources — some documents may have poor OCR quality or corrupted metadata","English-language focused — limited multilingual coverage despite global web crawl","Static snapshot from 2023 — does not include real-time or continuously updated documents","No built-in quality filtering for document relevance — requires downstream curation for domain-specific applications"],"requires":["Hugging Face Datasets library (datasets>=2.0)","Minimum 500GB storage for partial dataset access","Python 3.8+","For full training: distributed computing infrastructure (GPU/TPU clusters with 100GB+ VRAM)"],"input_types":["PDF documents (native and scanned)","Document metadata (source URLs, timestamps)","Layout annotations (bounding boxes, page structure)"],"output_types":["Extracted text with structure preservation","Image representations of document pages","Paired text-image samples for multimodal training","Metadata including document source and processing metadata"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_1","uri":"capability://data.processing.analysis.paired.image.text.dataset.construction.for.vision.language.training","name":"paired image-text dataset construction for vision-language training","description":"Provides structured image-text pairs extracted from PDF documents where images are document pages and text is extracted content, enabling direct training of vision-language models without manual annotation. The dataset architecture preserves the natural alignment between visual document layout and corresponding text, creating implicit supervision signals. Processing pipeline handles page segmentation, text-image alignment, and quality filtering across millions of document samples.","intents":["Train vision-language models on document understanding without manual annotation","Create aligned image-text datasets for contrastive learning (CLIP-style training)","Build models that can answer questions about document content based on visual input","Develop document classification and retrieval systems using multimodal embeddings"],"best_for":["ML teams training CLIP-style vision-language models","Researchers building document question-answering systems","Organizations developing multimodal retrieval systems","Teams working on document-based RAG (retrieval-augmented generation) systems"],"limitations":["Implicit alignment between images and text may be noisy — some documents have complex layouts where text-image correspondence is ambiguous","Page-level granularity may be too coarse for fine-grained visual reasoning tasks requiring sub-document element understanding","No explicit quality scores for image-text pairs — requires downstream filtering for high-quality training data","Scanned PDFs may have variable OCR quality affecting text reliability for training"],"requires":["Hugging Face Datasets library with streaming support","Image processing libraries (PIL, OpenCV)","Python 3.8+","For training: PyTorch or TensorFlow with multimodal model support"],"input_types":["PDF pages (as images)","Extracted text content","Document metadata and source information"],"output_types":["Image tensors (document page renderings)","Text strings (extracted content)","Paired samples for contrastive learning","Metadata linking images to source documents"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_2","uri":"capability://data.processing.analysis.large.scale.text.corpus.for.language.model.pretraining","name":"large-scale text corpus for language model pretraining","description":"Supplies 1 trillion tokens of English text extracted from PDF documents, suitable for pretraining or continued training of large language models. The corpus is derived from diverse document sources across Common Crawl, providing varied writing styles, domains, and content types. Processing pipeline includes tokenization, deduplication, and quality filtering to ensure training data suitability while maintaining scale.","intents":["Pretrain or continue-train large language models with document-sourced text","Create domain-specific language models by fine-tuning on document corpora","Build specialized models for document understanding and analysis tasks","Augment existing pretraining datasets with document-specific content"],"best_for":["ML researchers training foundation models with document-heavy content","Teams building domain-specific language models (legal, scientific, technical)","Organizations needing large-scale English text corpora for model training","Researchers studying how document structure affects language model behavior"],"limitations":["1 trillion tokens is substantial but smaller than largest modern pretraining corpora (e.g., Llama 2 used 2 trillion tokens) — may require supplementation for state-of-the-art models","English-only focus limits multilingual model development","Document-sourced text may have different statistical properties than web text (e.g., higher formality, different domain distribution) — requires careful mixing with other corpora","No explicit quality tiers — all tokens treated equally despite potential variation in source document quality","Static snapshot from 2023 — does not reflect evolving language use or recent events"],"requires":["Hugging Face Datasets library","Tokenizer compatible with target model (e.g., GPT-2, LLaMA tokenizers)","Python 3.8+","For training: distributed training infrastructure (Ray, DeepSpeed, or similar) with multi-GPU/TPU support"],"input_types":["Raw extracted text from PDFs","Document metadata (source, timestamp)"],"output_types":["Tokenized sequences","Text chunks at various granularities","Metadata-annotated text samples"],"categories":["data-processing-analysis","text-generation-language"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_3","uri":"capability://data.processing.analysis.document.domain.dataset.sampling.and.filtering","name":"document-domain dataset sampling and filtering","description":"Enables selective access to dataset subsets filtered by document characteristics (source domain, document type, quality metrics) without downloading the full 1 trillion token corpus. The dataset infrastructure supports streaming access with client-side filtering, allowing researchers to construct domain-specific training sets from the larger collection. Filtering operates on document metadata including source URLs, extraction quality scores, and document type classifications.","intents":["Create domain-specific training datasets (e.g., scientific papers, legal documents, technical manuals)","Sample balanced datasets across document types for targeted model training","Filter out low-quality documents or specific sources for quality-focused training","Explore dataset composition and statistics without full download"],"best_for":["Researchers building domain-specific models without full dataset download","Teams with limited storage requiring selective dataset access","Organizations needing quality-filtered subsets for production training","Exploratory researchers analyzing dataset composition and statistics"],"limitations":["Filtering operates on available metadata — fine-grained content-based filtering (e.g., by topic or writing style) requires downloading and processing samples","Streaming access adds latency compared to local dataset copies — not suitable for repeated training iterations without caching","No pre-computed quality scores — filtering by quality requires custom evaluation","Limited filtering dimensions — metadata may not capture all relevant document characteristics"],"requires":["Hugging Face Datasets library with streaming support","Network connectivity for streaming access","Python 3.8+","Hugging Face account for dataset access"],"input_types":["Filter criteria (domain, document type, quality thresholds)","Metadata queries"],"output_types":["Filtered dataset subsets","Dataset statistics and composition information","Sampled documents matching filter criteria"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_4","uri":"capability://data.processing.analysis.document.structure.and.layout.preservation.in.extraction","name":"document structure and layout preservation in extraction","description":"Maintains document layout information (page structure, text positioning, formatting) during PDF-to-text conversion, enabling models to learn relationships between visual layout and semantic content. The extraction pipeline preserves spatial coordinates, text ordering, and structural hierarchy (headings, sections, lists) rather than flattening documents to linear text. This architectural choice enables training of layout-aware models that can reason about document organization.","intents":["Train models that understand document structure and layout significance","Build systems that can extract information based on visual document organization","Create models that preserve formatting when processing documents","Develop layout-aware document understanding and retrieval systems"],"best_for":["Researchers building layout-aware document understanding models","Teams developing document structure analysis systems","Organizations building document-to-structured-data extraction pipelines","Researchers studying how document layout affects information extraction"],"limitations":["Layout preservation adds complexity to data representation — requires specialized handling in model architectures","Scanned PDFs may have inconsistent or degraded layout information affecting structure preservation","Layout-aware training requires models with spatial reasoning capabilities — not compatible with simple text-only architectures","No standardized format for layout representation — requires custom parsing for different model frameworks"],"requires":["PDF parsing libraries with layout support (e.g., pdfplumber, PyPDF2)","Model architectures supporting spatial/layout information (vision transformers, layout-aware LLMs)","Python 3.8+"],"input_types":["PDF documents with layout information","Page structure metadata (coordinates, text boxes)"],"output_types":["Structured text with layout annotations","Spatial coordinate information","Hierarchical document structure representations","Image-text pairs preserving layout context"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-40__cap_5","uri":"capability://data.processing.analysis.common.crawl.pdf.snapshot.integration.and.versioning","name":"common crawl pdf snapshot integration and versioning","description":"Provides access to a specific snapshot of PDF documents from Common Crawl (2023-40 version), with consistent versioning and reproducibility guarantees. The dataset is built from a fixed Common Crawl snapshot, enabling reproducible research and consistent data across training runs. Infrastructure includes metadata linking documents to their Common Crawl source, enabling traceability and potential re-extraction with updated pipelines.","intents":["Access reproducible, versioned document corpora for research","Build models with traceable data provenance for publication and reproducibility","Compare model performance across different dataset versions","Understand document source distribution and Common Crawl composition"],"best_for":["Researchers requiring reproducible datasets for published work","Teams building models with strict data provenance requirements","Organizations comparing model performance across dataset versions","Researchers studying Common Crawl composition and quality"],"limitations":["Static snapshot from 2023 — does not include documents added to Common Crawl after snapshot date","Common Crawl PDF quality varies — includes spam, corrupted files, and low-quality documents without filtering","No automatic updates — requires manual re-processing for new Common Crawl snapshots","Versioning is dataset-specific — may not align with other Common Crawl-based datasets"],"requires":["Hugging Face Datasets library","Knowledge of Common Crawl structure and metadata","Python 3.8+"],"input_types":["Common Crawl snapshot identifiers","Document source URLs"],"output_types":["Extracted documents with source metadata","Common Crawl provenance information","Version identifiers and timestamps"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":23,"verified":false,"data_access_risk":"low","permissions":["Hugging Face Datasets library (datasets>=2.0)","Minimum 500GB storage for partial dataset access","Python 3.8+","For full training: distributed computing infrastructure (GPU/TPU clusters with 100GB+ VRAM)","Hugging Face Datasets library with streaming support","Image processing libraries (PIL, OpenCV)","For training: PyTorch or TensorFlow with multimodal model support","Hugging Face Datasets library","Tokenizer compatible with target model (e.g., GPT-2, LLaMA tokenizers)","For training: distributed training infrastructure (Ray, DeepSpeed, or similar) with multi-GPU/TPU support"],"failure_modes":["Dataset is 100B-1T tokens in size — requires significant storage (terabyte-scale infrastructure) and computational resources for full training","PDF quality varies across Common Crawl sources — some documents may have poor OCR quality or corrupted metadata","English-language focused — limited multilingual coverage despite global web crawl","Static snapshot from 2023 — does not include real-time or continuously updated documents","No built-in quality filtering for document relevance — requires downstream curation for domain-specific applications","Implicit alignment between images and text may be noisy — some documents have complex layouts where text-image correspondence is ambiguous","Page-level granularity may be too coarse for fine-grained visual reasoning tasks requiring sub-document element understanding","No explicit quality scores for image-text pairs — requires downstream filtering for high-quality training data","Scanned PDFs may have variable OCR quality affecting text reliability for training","1 trillion tokens is substantial but smaller than largest modern pretraining corpora (e.g., Llama 2 used 2 trillion tokens) — may require supplementation for state-of-the-art models","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.22,"ecosystem":0.5000000000000001,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:22.764Z","last_scraped_at":"2026-04-22T08:08:14.361Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mlfoundations--mint-1t-pdf-cc-2023-40","compare_url":"https://unfragile.ai/compare?artifact=mlfoundations--mint-1t-pdf-cc-2023-40"}},"signature":"j/8Ism+Hzo2t4WKK6Aa59f6VhDpru9yRajzN/4fYUr2qkvayZ7XsIGiWVL2VMMxv4zMDhUo6kYWc1SjdvoRfBA==","signedAt":"2026-06-21T23:01:02.465Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mlfoundations--mint-1t-pdf-cc-2023-40","artifact":"https://unfragile.ai/mlfoundations--mint-1t-pdf-cc-2023-40","verify":"https://unfragile.ai/api/v1/verify?slug=mlfoundations--mint-1t-pdf-cc-2023-40","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}