{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"hf-dataset-hf-doc-build--doc-build","slug":"hf-doc-build--doc-build","name":"doc-build","type":"dataset","url":"https://huggingface.co/datasets/hf-doc-build/doc-build","page_url":"https://unfragile.ai/hf-doc-build--doc-build","categories":["model-training"],"tags":["license:mit","region:us"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"hf-dataset-hf-doc-build--doc-build__cap_0","uri":"capability://data.processing.analysis.documentation.source.code.pair.extraction.and.indexing","name":"documentation-source-code-pair extraction and indexing","description":"Extracts aligned pairs of documentation text and source code from HuggingFace repositories and related projects, organizing them into a structured dataset with 282,022 examples. The dataset uses a collection pipeline that crawls public repositories, parses documentation files (Markdown, RST, HTML), correlates them with corresponding source code files through AST analysis and file path heuristics, and stores the pairs in a standardized format (typically Parquet or JSON Lines) with metadata including source repository, file paths, and documentation type. This enables downstream models to learn the relationship between natural language documentation and code implementation.","intents":["Train code-to-documentation generation models that can automatically write docstrings and API documentation from source code","Build documentation-to-code retrieval systems that find relevant code implementations given natural language queries","Develop code summarization models that learn to explain code behavior in natural language","Create documentation quality assessment models by learning patterns of well-documented code"],"best_for":["ML researchers training neural models for code documentation tasks","Teams building IDE plugins that auto-generate docstrings from code","Organizations developing code-to-documentation search engines","Academic groups studying the relationship between code and natural language"],"limitations":["Dataset is static snapshot — does not automatically update as source repositories evolve; requires periodic re-crawling to capture new documentation patterns","Documentation-code alignment is heuristic-based (path matching, AST correlation) and may have false positives/negatives, especially for complex multi-file documentation","Heavily skewed toward Python and JavaScript projects due to HuggingFace ecosystem composition; limited coverage of Java, C++, Rust documentation patterns","No built-in deduplication — may contain near-duplicate pairs from forked repositories or similar projects","Metadata is minimal — lacks information about documentation quality, code complexity metrics, or temporal relationships"],"requires":["HuggingFace datasets library (transformers>=4.0)","Python 3.7+","Disk space for full dataset (~2-5GB depending on format)","Internet connection to download from HuggingFace Hub"],"input_types":["HuggingFace dataset identifier string","Optional filtering parameters (repository name, language, documentation type)"],"output_types":["Structured records with fields: documentation_text (string), source_code (string), repository (string), file_path (string), language (string)","Batch exports in Parquet, JSON Lines, or CSV format","PyArrow Table for in-memory processing"],"categories":["data-processing-analysis","dataset-curation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-hf-doc-build--doc-build__cap_1","uri":"capability://data.processing.analysis.multi.language.code.documentation.corpus.filtering.and.sampling","name":"multi-language code-documentation corpus filtering and sampling","description":"Provides mechanisms to filter and sample the documentation-code pairs by programming language, documentation format (docstring, API docs, README), and repository characteristics. The dataset supports stratified sampling to create balanced subsets across languages and documentation types, and includes metadata fields that enable downstream filtering without re-downloading the full dataset. Filtering is performed at the HuggingFace dataset level using the library's built-in map() and filter() operations, which are optimized for lazy evaluation and streaming to avoid loading the entire dataset into memory.","intents":["Create language-specific training subsets (e.g., Python-only documentation corpus for fine-tuning Python-focused models)","Build balanced datasets across multiple languages to train multilingual code-documentation models","Sample representative subsets for rapid prototyping and validation before full-scale training","Analyze documentation patterns by language and repository type to understand best practices"],"best_for":["ML engineers fine-tuning models on specific programming languages","Researchers studying cross-language documentation patterns","Teams with limited compute budgets needing representative subsets","Data scientists performing exploratory analysis on code-documentation relationships"],"limitations":["Filtering operations require loading metadata for all 282k examples into memory; full-dataset filtering may use 1-2GB RAM on machines with limited resources","No built-in stratified sampling API — requires manual implementation using HuggingFace dataset utilities or external libraries like scikit-learn","Language detection relies on file extension heuristics; may misclassify polyglot repositories or files with non-standard extensions","Documentation type classification (docstring vs API docs vs README) is imperfect and based on file path patterns rather than content analysis"],"requires":["HuggingFace datasets library with filter() and map() support","Python 3.7+","Optional: scikit-learn or pandas for advanced sampling strategies"],"input_types":["Filter predicates (lambda functions or column-based conditions)","Sampling parameters (fraction, seed, stratification column)"],"output_types":["Filtered HuggingFace Dataset object","Sampled subset as Parquet, JSON Lines, or in-memory table","Statistics on filtered subset (count by language, documentation type)"],"categories":["data-processing-analysis","dataset-curation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-hf-doc-build--doc-build__cap_2","uri":"capability://data.processing.analysis.documentation.code.pair.validation.and.quality.assessment","name":"documentation-code pair validation and quality assessment","description":"Enables assessment of alignment quality between documentation and code pairs through structural validation and heuristic scoring. The dataset includes metadata that can be used to compute alignment metrics: code-to-documentation length ratios, presence of code examples in documentation, consistency of function/class names between documentation and implementation, and documentation coverage (percentage of public APIs documented). These metrics are computed via post-processing scripts that parse code ASTs and documentation text, comparing extracted identifiers and structure to measure alignment strength.","intents":["Filter out low-quality or misaligned documentation-code pairs before training to improve model quality","Identify and flag documentation that is outdated or inconsistent with code implementation","Measure documentation coverage across codebases to identify undocumented APIs","Create quality-weighted datasets where high-alignment pairs receive higher training weight"],"best_for":["ML teams building production code-documentation models that require high-quality training data","Code quality auditors assessing documentation completeness in large codebases","Researchers studying the relationship between documentation quality and code maintainability","Data engineers building data pipelines that require quality gates"],"limitations":["Validation metrics are heuristic-based and may not capture semantic misalignment; a pair can have high structural alignment but document the wrong behavior","AST parsing is language-specific; validation is limited to languages with mature parser support (Python, JavaScript); limited validation for Java, C++, Rust","No human-in-the-loop validation — all quality scores are automated and may not reflect actual documentation usefulness","Validation adds computational overhead (~50-100ms per pair for AST parsing); full validation of 282k pairs requires significant compute time","Metrics are sensitive to code style and documentation conventions; may penalize valid but unconventional documentation patterns"],"requires":["Language-specific AST parsers (tree-sitter, ast module for Python, etc.)","Python 3.7+","Optional: spaCy or NLTK for NLP-based documentation analysis"],"input_types":["Documentation-code pair records from the dataset","Quality threshold parameters (minimum alignment score, coverage percentage)"],"output_types":["Quality scores (0-1 alignment score, coverage percentage, metric breakdown)","Filtered dataset containing only high-quality pairs","Quality report with statistics on alignment distribution"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-hf-doc-build--doc-build__cap_3","uri":"capability://data.processing.analysis.dataset.versioning.and.reproducible.data.splits","name":"dataset versioning and reproducible data splits","description":"Supports reproducible train/validation/test splits through deterministic seeding and version-pinned dataset snapshots on HuggingFace Hub. The dataset is versioned with Git-based revision tracking, allowing researchers to specify exact dataset versions in their experiments (e.g., 'revision=main' or 'revision=v1.0'). Splits are created using seeded random sampling, ensuring that the same split configuration produces identical results across different machines and time periods. This enables reproducibility in research and allows teams to compare models trained on identical data subsets.","intents":["Create reproducible train/validation/test splits for machine learning experiments that can be exactly replicated by other researchers","Version datasets alongside model checkpoints to ensure full experiment reproducibility","Compare model performance across teams using identical data splits","Track dataset evolution and understand how dataset changes impact model performance"],"best_for":["Academic researchers publishing papers with code-documentation models","ML teams requiring reproducible experiments for regulatory compliance or audit trails","Open-source projects maintaining consistent benchmarks across contributors","Organizations tracking model performance over time as datasets evolve"],"limitations":["Version pinning requires explicit revision specification; default behavior loads the latest version, which may differ from original training data","Deterministic splits require fixed random seeds; changes to HuggingFace dataset library versions may affect split reproducibility if internal random number generation changes","No built-in support for stratified splits across multiple dimensions (language + documentation type); requires custom implementation","Dataset versioning is tied to HuggingFace Hub infrastructure; offline or disconnected environments cannot access version history"],"requires":["HuggingFace datasets library with revision support","Python 3.7+","Internet connection to access HuggingFace Hub for version information"],"input_types":["Dataset identifier with optional revision (e.g., 'hf-doc-build/doc-build@v1.0')","Split configuration (train fraction, validation fraction, test fraction, random seed)"],"output_types":["Train/validation/test Dataset objects with identical composition across runs","Split metadata (number of examples per split, random seed used)","Version information (revision hash, creation date)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"hf-dataset-hf-doc-build--doc-build__cap_4","uri":"capability://data.processing.analysis.batch.dataset.export.and.format.conversion","name":"batch dataset export and format conversion","description":"Enables efficient export of the documentation-code dataset to multiple formats (Parquet, JSON Lines, CSV, Arrow) for integration with different ML frameworks and data pipelines. Exports are performed using HuggingFace's built-in save_to_disk() and to_csv()/to_json() methods, which support streaming and batching to avoid memory overflow on large datasets. The export process preserves all metadata fields and supports optional compression (gzip, snappy) to reduce storage footprint. Exported datasets can be directly loaded into PyTorch DataLoaders, TensorFlow tf.data pipelines, or processed with pandas/Polars for analysis.","intents":["Export filtered dataset subsets to local disk for training with PyTorch or TensorFlow without keeping full dataset in memory","Convert dataset to CSV or JSON for analysis in Jupyter notebooks or data exploration tools","Create compressed dataset archives for distribution to team members or publication with papers","Integrate dataset with existing data pipelines that expect specific formats (Parquet for Spark, JSON Lines for streaming systems)"],"best_for":["ML engineers integrating the dataset into PyTorch/TensorFlow training pipelines","Data analysts exploring the dataset in pandas or Polars","Teams distributing dataset subsets to collaborators with size constraints","Data engineers building ETL pipelines that consume the dataset"],"limitations":["Full dataset export to single CSV file is impractical (282k rows × multiple text columns = multi-GB file); requires partitioning or streaming export","JSON Lines format preserves all data but produces large files without compression; Parquet is more efficient but requires additional libraries","Export performance depends on disk I/O speed; exporting full dataset to local SSD takes 5-15 minutes depending on format and compression","Exported files lose the lazy-loading benefits of HuggingFace streaming; full exports must be loaded into memory for processing","Format conversion may lose metadata or type information if target format has limited schema support (e.g., CSV flattening nested fields)"],"requires":["HuggingFace datasets library","Python 3.7+","Disk space for exported format (2-5GB for uncompressed, 500MB-1GB compressed)","Optional: pyarrow for Parquet export, pandas for CSV export"],"input_types":["HuggingFace Dataset object (filtered or full)","Export format specification (parquet, json, csv, arrow)","Optional: compression algorithm (gzip, snappy), output path"],"output_types":["Parquet files (columnar, efficient for analytics)","JSON Lines files (one record per line, streaming-friendly)","CSV files (spreadsheet-compatible)","Arrow files (zero-copy in-memory format)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":21,"verified":false,"data_access_risk":"low","permissions":["HuggingFace datasets library (transformers>=4.0)","Python 3.7+","Disk space for full dataset (~2-5GB depending on format)","Internet connection to download from HuggingFace Hub","HuggingFace datasets library with filter() and map() support","Optional: scikit-learn or pandas for advanced sampling strategies","Language-specific AST parsers (tree-sitter, ast module for Python, etc.)","Optional: spaCy or NLTK for NLP-based documentation analysis","HuggingFace datasets library with revision support","Internet connection to access HuggingFace Hub for version information"],"failure_modes":["Dataset is static snapshot — does not automatically update as source repositories evolve; requires periodic re-crawling to capture new documentation patterns","Documentation-code alignment is heuristic-based (path matching, AST correlation) and may have false positives/negatives, especially for complex multi-file documentation","Heavily skewed toward Python and JavaScript projects due to HuggingFace ecosystem composition; limited coverage of Java, C++, Rust documentation patterns","No built-in deduplication — may contain near-duplicate pairs from forked repositories or similar projects","Metadata is minimal — lacks information about documentation quality, code complexity metrics, or temporal relationships","Filtering operations require loading metadata for all 282k examples into memory; full-dataset filtering may use 1-2GB RAM on machines with limited resources","No built-in stratified sampling API — requires manual implementation using HuggingFace dataset utilities or external libraries like scikit-learn","Language detection relies on file extension heuristics; may misclassify polyglot repositories or files with non-standard extensions","Documentation type classification (docstring vs API docs vs README) is imperfect and based on file path patterns rather than content analysis","Validation metrics are heuristic-based and may not capture semantic misalignment; a pair can have high structural alignment but document the wrong behavior","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.2,"ecosystem":0.36,"match_graph":0.25,"freshness":0.75,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-05-24T12:16:22.764Z","last_scraped_at":"2026-05-03T14:22:48.064Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=hf-doc-build--doc-build","compare_url":"https://unfragile.ai/compare?artifact=hf-doc-build--doc-build"}},"signature":"X39fhUp/6SOSQ2fSDWoauX71qfWIt9ohmrWweN12XWCh6JCiSyBB7hHvSqwT67MaQkX1LqSdbDrLrxW0Lzt5AA==","signedAt":"2026-06-20T08:24:01.571Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/hf-doc-build--doc-build","artifact":"https://unfragile.ai/hf-doc-build--doc-build","verify":"https://unfragile.ai/api/v1/verify?slug=hf-doc-build--doc-build","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}