{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14","slug":"mlfoundations--mint-1t-pdf-cc-2023-14","name":"MINT-1T-PDF-CC-2023-14","type":"dataset","url":"https://huggingface.co/datasets/mlfoundations/MINT-1T-PDF-CC-2023-14","page_url":"https://unfragile.ai/mlfoundations--mint-1t-pdf-cc-2023-14","categories":["model-training"],"tags":["task_categories:image-to-text","task_categories:text-generation","language:en","license:cc-by-4.0","size_categories:1M<n<10M","format:webdataset","modality:image","modality:text","library:datasets","library:webdataset","library:mlcroissant","arxiv:2406.11271","region:us","multimodal"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_0","uri":"capability://data.processing.analysis.large.scale.multimodal.document.image.text.dataset.loading","name":"large-scale multimodal document-image-text dataset loading","description":"Provides access to 1 trillion tokens of PDF-derived multimodal data (images + OCR text) from Common Crawl 2023-14, organized in WebDataset format for distributed streaming. Uses tar-based sharding architecture enabling efficient parallel loading across GPUs without requiring full dataset materialization on disk. Integrates with HuggingFace datasets library and MLCroissant metadata standard for reproducible, versioned access to 5.7M+ document samples.","intents":["train vision-language models on real-world document understanding tasks at scale","build multimodal retrieval systems using paired image-text document data","evaluate OCR and document layout understanding on diverse PDF sources","create synthetic training data pipelines for document classification and extraction"],"best_for":["ML researchers training large vision-language models (CLIP, LLaVA scale)","teams building document AI systems requiring diverse real-world PDF samples","organizations needing pre-processed, deduplicated multimodal training corpora"],"limitations":["5.7M samples may be insufficient for training models >10B parameters without augmentation","OCR quality varies by source document; no per-sample quality scores provided","WebDataset format requires sequential access patterns; random sampling requires full enumeration","CC-BY-4.0 license requires attribution in derivative works; commercial use requires compliance verification","No built-in filtering for sensitive document types (medical, financial, PII); requires downstream curation"],"requires":["HuggingFace datasets library (>=2.14.0)","WebDataset library (>=0.2.0) for efficient tar-based streaming","Python 3.8+","Minimum 100GB free disk space for partial caching; full dataset requires ~2TB","Network bandwidth for streaming from HuggingFace Hub or local mirror"],"input_types":["dataset identifier string (mlfoundations/MINT-1T-PDF-CC-2023-14)","configuration parameters (split, streaming mode, batch size)"],"output_types":["image tensors (variable resolution, typically 72-300 DPI)","OCR text strings (UTF-8 encoded)","metadata dictionaries (document source, page count, language tags)"],"categories":["data-processing-analysis","multimodal-datasets"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_1","uri":"capability://data.processing.analysis.ocr.aligned.image.text.pair.extraction.from.pdfs","name":"ocr-aligned image-text pair extraction from pdfs","description":"Automatically extracts and aligns image renderings of PDF pages with their corresponding OCR text output, preserving spatial relationships and document structure. Uses PDF parsing to generate page images at consistent DPI (72-300) and applies OCR engines (likely Tesseract or similar) to produce character-level text with bounding box metadata. Deduplication via content hashing removes near-duplicate pages across Common Crawl crawls.","intents":["train models to understand document layout and spatial text positioning","build systems that link visual regions in documents to extracted text","create datasets for document layout analysis and reading order prediction","evaluate OCR quality and correction models on real-world PDF diversity"],"best_for":["document layout analysis researchers","teams building document understanding pipelines (form extraction, table recognition)","OCR model developers needing diverse, real-world training examples"],"limitations":["OCR accuracy varies significantly by document quality, font, and language; no per-sample confidence scores","Spatial alignment between image and text may drift for complex multi-column layouts","Scanned PDFs with poor image quality produce degraded OCR; no quality filtering applied","No bounding box coordinates provided at scale; spatial metadata may be lossy","Language coverage limited to predominantly English documents from Common Crawl"],"requires":["PDF rendering library (PyPDF2, pdfplumber, or similar) for local inspection","Understanding of OCR output format and limitations","Compute for PDF-to-image conversion if processing locally (~0.5-2s per page)"],"input_types":["PDF documents from Common Crawl 2023-14 snapshot"],"output_types":["PNG/JPEG page images (variable resolution)","UTF-8 OCR text strings","Metadata: document source URL, page number, language tag"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_2","uri":"capability://automation.workflow.streaming.based.distributed.dataset.loading.for.multi.gpu.training","name":"streaming-based distributed dataset loading for multi-gpu training","description":"Implements WebDataset-compatible tar-based sharding that enables efficient parallel loading across distributed training clusters without materializing the full dataset on local storage. Each shard contains ~1000 samples; workers fetch shards on-demand and decompress in-memory, with built-in support for HuggingFace Datasets streaming mode and PyTorch DataLoader integration. Supports deterministic shuffling via seed-based shard ordering for reproducible training runs.","intents":["train large models on multi-GPU/multi-node clusters without requiring centralized NAS","reduce training startup time by streaming data on-demand rather than pre-downloading","enable fault-tolerant training with automatic shard re-fetching on worker failure","scale training to datasets larger than any single machine's storage capacity"],"best_for":["ML teams with distributed training infrastructure (Ray, PyTorch DDP, DeepSpeed)","organizations training models on cloud infrastructure with limited persistent storage","researchers requiring reproducible, version-controlled dataset access across runs"],"limitations":["Streaming adds ~50-200ms latency per shard fetch depending on network bandwidth","Deterministic shuffling requires knowing total shard count upfront; dynamic dataset growth not supported","WebDataset format requires sequential access within shards; random access requires full enumeration","No built-in caching strategy; repeated epochs re-fetch identical shards unless local cache configured","Requires stable network connectivity; transient failures may stall training without retry logic"],"requires":["PyTorch 1.9+ with DataLoader support","WebDataset library (>=0.2.0)","HuggingFace datasets (>=2.14.0)","Network bandwidth >=100 Mbps for efficient streaming","Distributed training framework (PyTorch DDP, DeepSpeed, or Ray)"],"input_types":["dataset configuration (split, batch_size, num_workers)","seed for deterministic shuffling"],"output_types":["batched tensors (images, text) compatible with PyTorch DataLoader","metadata dictionaries per sample"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_3","uri":"capability://memory.knowledge.mlcroissant.metadata.standard.compliance.and.reproducibility","name":"mlcroissant metadata standard compliance and reproducibility","description":"Publishes dataset metadata in MLCroissant format (W3C standard for machine learning datasets), enabling automated discovery, versioning, and reproducible access through standardized schema. Includes structured descriptions of splits, features, licenses, and data provenance (Common Crawl 2023-14 snapshot). Enables tools like HuggingFace Hub and Croissant parsers to automatically validate dataset integrity and generate data cards.","intents":["ensure reproducible dataset access across research teams and time","enable automated dataset discovery and filtering by metadata (license, modality, size)","generate standardized data documentation for model cards and research papers","validate dataset integrity and track provenance through Common Crawl versions"],"best_for":["research teams publishing models requiring reproducible dataset specifications","organizations building dataset catalogs and discovery systems","ML practitioners needing standardized metadata for compliance and auditing"],"limitations":["MLCroissant standard is still evolving; not all dataset properties map cleanly to schema","Metadata does not include per-sample quality scores or filtering recommendations","Provenance tracking limited to Common Crawl snapshot version; no fine-grained source attribution per document","No automated validation of OCR quality or image resolution consistency in metadata"],"requires":["MLCroissant parser library (croissant-py or similar)","Understanding of W3C Croissant schema","HuggingFace Datasets library for automated metadata loading"],"input_types":["MLCroissant JSON-LD metadata file"],"output_types":["structured metadata dictionary (splits, features, license, provenance)","data card HTML/markdown for documentation"],"categories":["memory-knowledge","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_4","uri":"capability://data.processing.analysis.common.crawl.2023.14.snapshot.filtering.and.deduplication","name":"common crawl 2023-14 snapshot filtering and deduplication","description":"Curates and deduplicates content from Common Crawl's 2023-14 snapshot using content hashing (likely SHA-256 or similar) to remove near-duplicate PDF pages across multiple crawl cycles. Applies language detection to filter predominantly English documents and removes known low-quality sources. Preserves document source URLs and metadata for traceability.","intents":["obtain diverse, real-world document samples without redundancy from web crawls","train models on authentic document distributions as they appear on the web","evaluate model robustness on varied document quality and formatting","trace document provenance back to original source URLs for validation"],"best_for":["researchers building models that must generalize to real-world web documents","teams needing authentic document diversity without synthetic augmentation","organizations requiring source attribution and URL traceability"],"limitations":["Deduplication may remove legitimate variations of similar documents (e.g., different versions of same form)","Language filtering is imperfect; non-English documents may remain, and English-heavy bias is introduced","Low-quality source filtering is heuristic-based; no manual review of excluded content","Common Crawl snapshot is static (2023-14); no continuous updates or newer content","No per-document quality scores; users must apply downstream filtering for specific use cases"],"requires":["Understanding of Common Crawl structure and WARC format","Familiarity with content hashing and deduplication techniques","Access to Common Crawl 2023-14 snapshot metadata"],"input_types":["Common Crawl 2023-14 WARC records and metadata"],"output_types":["deduplicated PDF documents with source URLs","metadata: document source, crawl timestamp, language tag"],"categories":["data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-mlfoundations--mint-1t-pdf-cc-2023-14__cap_5","uri":"capability://image.visual.variable.resolution.image.rendering.with.dpi.consistency","name":"variable-resolution image rendering with dpi consistency","description":"Renders PDF pages to images at configurable DPI (72-300 range) to balance visual fidelity with storage efficiency. Uses PDF rendering engines (likely poppler or similar) to convert vector-based PDF content to raster images while preserving text and layout information. Applies consistent DPI across dataset to enable batch processing without resolution normalization.","intents":["create training data with consistent visual quality across diverse PDF sources","enable models to learn document understanding at realistic screen/print resolutions","balance storage efficiency with visual fidelity for different downstream tasks","support OCR training on images with consistent rendering quality"],"best_for":["vision-language model developers requiring consistent image quality","document understanding researchers needing realistic rendering fidelity","teams optimizing storage-to-quality tradeoffs for large-scale training"],"limitations":["Fixed DPI may be suboptimal for documents designed for specific resolutions (e.g., 600 DPI scans)","Vector PDF content may render differently across rendering engines; no standardization guarantee","Rendering quality depends on embedded fonts; missing fonts may degrade output","No adaptive DPI selection based on document complexity or content type","Rendering adds computational overhead (~0.5-2s per page); no pre-computed resolution variants provided"],"requires":["PDF rendering library (poppler, pdfplumber, PyMuPDF)","Sufficient compute for PDF-to-image conversion","Understanding of DPI tradeoffs for specific use cases"],"input_types":["PDF documents at variable resolutions"],"output_types":["PNG/JPEG images at consistent DPI (72-300 range)","variable dimensions depending on page size and DPI"],"categories":["image-visual","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":23,"verified":false,"data_access_risk":"low","permissions":["HuggingFace datasets library (>=2.14.0)","WebDataset library (>=0.2.0) for efficient tar-based streaming","Python 3.8+","Minimum 100GB free disk space for partial caching; full dataset requires ~2TB","Network bandwidth for streaming from HuggingFace Hub or local mirror","PDF rendering library (PyPDF2, pdfplumber, or similar) for local inspection","Understanding of OCR output format and limitations","Compute for PDF-to-image conversion if processing locally (~0.5-2s per page)","PyTorch 1.9+ with DataLoader support","WebDataset library (>=0.2.0)"],"failure_modes":["5.7M samples may be insufficient for training models >10B parameters without augmentation","OCR quality varies by source document; no per-sample quality scores provided","WebDataset format requires sequential access patterns; random sampling requires full enumeration","CC-BY-4.0 license requires attribution in derivative works; commercial use requires compliance verification","No built-in filtering for sensitive document types (medical, financial, PII); requires downstream curation","OCR accuracy varies significantly by document quality, font, and language; no per-sample confidence scores","Spatial alignment between image and text may drift for complex multi-column layouts","Scanned PDFs with poor image quality produce degraded OCR; no quality filtering applied","No bounding box coordinates provided at scale; spatial metadata may be lossy","Language coverage limited to predominantly English documents from Common Crawl","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.22,"ecosystem":0.5000000000000001,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:22.764Z","last_scraped_at":"2026-04-22T08:08:14.361Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mlfoundations--mint-1t-pdf-cc-2023-14","compare_url":"https://unfragile.ai/compare?artifact=mlfoundations--mint-1t-pdf-cc-2023-14"}},"signature":"EjMjRSKfPfjkFLA234ZuuVCkWXzb0yaEStYisfBzXO6V6IwQdxaSwI2ubdTtnOfq6qhgbuWHw0Sh2RL5mTazCw==","signedAt":"2026-06-21T09:04:56.747Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mlfoundations--mint-1t-pdf-cc-2023-14","artifact":"https://unfragile.ai/mlfoundations--mint-1t-pdf-cc-2023-14","verify":"https://unfragile.ai/api/v1/verify?slug=mlfoundations--mint-1t-pdf-cc-2023-14","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}