{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"pypi_pypi-datasets","slug":"pypi-datasets","name":"datasets","type":"dataset","url":"https://github.com/huggingface/datasets","page_url":"https://unfragile.ai/pypi-datasets","categories":["model-training"],"tags":["datasets","machine","learning","datasets"],"pricing":{"model":"open_source","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"pypi_pypi-datasets__cap_0","uri":"capability://data.processing.analysis.arrow.backed.in.memory.dataset.loading.and.manipulation","name":"arrow-backed in-memory dataset loading and manipulation","description":"Loads datasets into memory as PyArrow Table objects via the Dataset class, enabling columnar storage with zero-copy access patterns. The ArrowDataset abstraction wraps PyArrow's Table API, providing lazy evaluation for transformations (map, filter, select) that are compiled into Arrow compute expressions rather than executed immediately. This approach enables efficient memory usage and fast iteration over structured data with native support for nested types, media features (images, audio), and distributed processing.","intents":["Load a CSV or Parquet file into memory and perform column-wise transformations without materializing intermediate results","Access individual rows or slices of a dataset with O(1) lookup time using Arrow's columnar indexing","Apply map/filter operations to datasets with automatic caching and fingerprinting to avoid recomputation"],"best_for":["ML engineers building training pipelines with datasets that fit in memory (< 100GB)","Data scientists prototyping transformations on structured tabular data","Teams using PyTorch/TensorFlow that need efficient data loading with framework integration"],"limitations":["Entire dataset must fit in available RAM; no built-in out-of-core processing for datasets > system memory","Transformations are lazy but still require full dataset scan on first execution, adding latency for large datasets","Arrow Table schema is immutable after creation; schema changes require full dataset rewrite"],"requires":["Python 3.8+","PyArrow 1.0+","Sufficient RAM to hold dataset in memory"],"input_types":["CSV files","Parquet files","JSON/JSONL","Python dictionaries/lists","Pandas DataFrames"],"output_types":["PyArrow Table","Pandas DataFrame","NumPy arrays","PyTorch DataLoader","TensorFlow tf.data.Dataset"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_1","uri":"capability://data.processing.analysis.streaming.dataset.iteration.with.memory.bounded.buffering","name":"streaming dataset iteration with memory-bounded buffering","description":"The IterableDataset class enables streaming data loading without materializing the full dataset in memory, using a buffer-based approach that fetches data in configurable chunks. Implements a generator-based iteration pattern where data is downloaded and processed on-the-fly, with optional local caching of streamed batches. This architecture supports infinite datasets and enables training on datasets larger than available RAM by trading off random access for sequential streaming efficiency.","intents":["Train models on datasets larger than available GPU/CPU memory by streaming batches sequentially","Process web-scale datasets (e.g., Common Crawl) without downloading the entire corpus first","Implement distributed training where each worker streams a different shard of the dataset"],"best_for":["Researchers training on massive datasets (ImageNet-scale or larger) with limited local storage","Production ML pipelines that need to handle unbounded data streams","Distributed training setups where data sharding across workers is critical"],"limitations":["No random access to dataset elements; iteration is strictly sequential, preventing shuffling without buffering entire dataset","Streaming introduces network latency; effective throughput depends on download speed and buffer size tuning","Reproducibility requires explicit seed management; default behavior may not guarantee deterministic ordering across runs"],"requires":["Python 3.8+","Network connectivity for remote datasets","Configurable buffer size based on available RAM (default 1000 examples)"],"input_types":["Remote Parquet files","JSONL streams","Hugging Face Hub datasets","Custom generator functions"],"output_types":["Generator yielding individual examples","Batched iterators","PyTorch IterableDataset wrapper"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_10","uri":"capability://data.processing.analysis.data.file.discovery.and.pattern.matching.for.multi.file.datasets","name":"data file discovery and pattern matching for multi-file datasets","description":"The data_files module automatically discovers and matches data files based on glob patterns and file extensions, enabling loading of datasets split across multiple files (e.g., train_*.parquet, test_*.csv). The system supports hierarchical directory structures, multiple file formats in a single dataset, and custom pattern matching logic. It handles file listing, format detection, and split assignment automatically, abstracting away file system complexity.","intents":["Load a dataset split across 100 Parquet files (train_0.parquet, train_1.parquet, ...) by specifying a glob pattern","Automatically discover train/test/validation splits from a directory structure without manual file listing","Load a dataset with mixed formats (some splits in Parquet, others in CSV) with automatic format detection"],"best_for":["Data engineers managing large datasets split across multiple files","Teams with datasets organized in hierarchical directory structures","Researchers working with public datasets that use glob-based file organization"],"limitations":["Pattern matching is glob-based; complex file naming schemes may require custom matching logic","File discovery is eager; listing thousands of files can be slow on network file systems","No built-in support for compressed archives (tar.gz, zip); files must be extracted first"],"requires":["Python 3.8+","Files accessible via local filesystem or remote URLs","Glob pattern matching support"],"input_types":["Directory path","Glob pattern string","File extension list"],"output_types":["List of file paths","Split-to-files mapping","Discovered file formats"],"categories":["data-processing-analysis","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_11","uri":"capability://data.processing.analysis.dataset.splitting.and.train.test.validation.partitioning","name":"dataset splitting and train/test/validation partitioning","description":"The train_test_split() method partitions a dataset into multiple splits (train, test, validation) with configurable ratios and optional stratification. The system supports deterministic splitting via seed-based shuffling, stratified splitting to maintain class distributions, and custom split functions. The implementation returns a DatasetDict with named splits, enabling easy access to each partition throughout the training pipeline.","intents":["Split a dataset into 80% train and 20% test with deterministic shuffling using a fixed seed","Create stratified splits that maintain the original class distribution across train/test sets","Partition a dataset into train/validation/test with custom ratios (e.g., 60/20/20)"],"best_for":["ML practitioners building training pipelines with standard train/test splits","Data scientists working with imbalanced datasets that need stratified splitting","Researchers ensuring reproducible dataset partitioning across experiments"],"limitations":["Stratification requires a label column; no automatic stratification for multi-label or regression tasks","Splitting is deterministic but requires explicit seed; default behavior may vary across library versions","No support for time-based splitting (e.g., temporal train/test split for time series)"],"requires":["Python 3.8+","Dataset with sufficient examples for desired split ratios","Optional: label column for stratified splitting"],"input_types":["Dataset","Train/test ratio (float or dict of ratios)","Seed (integer)","Stratify column name (optional)"],"output_types":["DatasetDict with 'train' and 'test' keys","DatasetDict with custom split names"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_12","uri":"capability://automation.workflow.metadata.and.dataset.card.generation.with.standardized.documentation","name":"metadata and dataset card generation with standardized documentation","description":"The DatasetCard class provides a structured format for dataset documentation following Hugging Face standards, including description, license, citations, and usage instructions. The system generates cards from templates and metadata, validates card structure, and publishes cards to the Hub alongside datasets. The architecture supports both manual card creation and automatic generation from dataset properties.","intents":["Create a dataset card documenting the dataset's purpose, license, and how to cite it for publication","Automatically generate a basic dataset card from dataset metadata and publish it to Hugging Face Hub","Validate that a dataset card follows Hugging Face standards before publishing"],"best_for":["Dataset maintainers publishing datasets with proper documentation and attribution","Researchers sharing datasets alongside papers with standardized metadata","Teams ensuring dataset governance with documented licenses and usage terms"],"limitations":["Card generation is template-based; complex documentation requires manual editing","No automatic extraction of metadata from data files; manual specification required","Validation is basic; no enforcement of completeness or quality standards"],"requires":["Python 3.8+","Dataset object or metadata dictionary","Optional: Hugging Face Hub account for publishing"],"input_types":["Dataset metadata (dict)","Card template (Markdown or YAML)","Dataset object"],"output_types":["DatasetCard object","Markdown card content","Validated card structure"],"categories":["automation-workflow","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_2","uri":"capability://data.processing.analysis.unified.dataset.loading.from.multiple.sources.via.load.dataset.api","name":"unified dataset loading from multiple sources via load_dataset api","description":"The load_dataset() function provides a single entry point for loading datasets from diverse sources (local files, Hugging Face Hub, remote URLs, custom scripts) by routing to appropriate DatasetBuilder implementations. The system uses a plugin architecture where each dataset is defined by a builder module (Python script or packaged module) that specifies download logic, data file patterns, and feature schemas. The API handles caching, version management, and automatic format detection, abstracting away source-specific complexity.","intents":["Load a public dataset from Hugging Face Hub with a single line of code without knowing its storage format or location","Create a custom dataset loader for proprietary data by writing a DatasetBuilder subclass","Automatically cache downloaded datasets locally and reuse cached versions across runs"],"best_for":["ML practitioners who want quick access to standard benchmarks without format conversion","Dataset maintainers publishing datasets to Hugging Face Hub for community use","Teams building internal dataset catalogs with standardized loading interfaces"],"limitations":["Custom builders require Python knowledge; no low-code UI for defining data loading logic","Hub-hosted datasets depend on Hugging Face infrastructure availability; no built-in fallback mirrors","Large datasets may have slow initial download; caching is local-only without distributed cache support"],"requires":["Python 3.8+","Internet connectivity for Hub datasets","Hugging Face account (optional, for private datasets)","Sufficient disk space for caching (configurable via HF_DATASETS_CACHE)"],"input_types":["Dataset identifier string (e.g., 'wikitext', 'mnist')","Local file paths","Remote URLs","Custom Python builder scripts"],"output_types":["Dataset (for single split)","DatasetDict (for multiple splits like train/test)","IterableDataset (for streaming mode)"],"categories":["data-processing-analysis","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_3","uri":"capability://data.processing.analysis.lazy.transformation.compilation.with.fingerprinting.and.caching","name":"lazy transformation compilation with fingerprinting and caching","description":"The map(), filter(), and select() operations compile transformations into a computation graph that is executed lazily, with each operation assigned a deterministic fingerprint based on the function code and input dataset state. This fingerprinting system enables automatic caching of intermediate results; if the same transformation is applied twice, the cached result is reused. The architecture stores transformation metadata (function hash, parameters) alongside cached data, enabling reproducibility and avoiding redundant computation across runs.","intents":["Apply a preprocessing function to a dataset and automatically cache the result to avoid recomputation on subsequent runs","Chain multiple transformations (tokenization → padding → batching) and only recompute the parts that changed","Verify that two datasets are equivalent by comparing their fingerprints without materializing both"],"best_for":["Data scientists iterating on preprocessing pipelines who want to avoid recomputing expensive transformations","ML engineers building reproducible training pipelines with deterministic data processing","Teams sharing preprocessed datasets where fingerprints serve as integrity checksums"],"limitations":["Fingerprinting requires function code to be serializable; lambda functions and closures may not fingerprint correctly, requiring explicit function definitions","Cache invalidation is automatic but can be surprising; changing function behavior without changing code (e.g., external state) won't invalidate cache","Fingerprinting adds overhead (~10-50ms per operation) for small datasets where caching benefit is minimal"],"requires":["Python 3.8+","Writable cache directory (default ~/.cache/huggingface/datasets)","Deterministic transformation functions (no randomness or external state)"],"input_types":["Dataset or DatasetDict","Python callable (function or lambda)","Column names (for select)"],"output_types":["Dataset with cached transformation results","Fingerprint hash string","Metadata JSON with transformation history"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_4","uri":"capability://data.processing.analysis.feature.type.system.with.schema.validation.and.media.encoding.decoding","name":"feature type system with schema validation and media encoding/decoding","description":"The Features class defines a schema for dataset columns with support for primitive types (int, string, float), nested structures (sequences, dicts), and media types (Image, Audio, Video). Each feature type includes encoding logic (serialization to Arrow format) and decoding logic (deserialization to Python objects or framework-specific formats). The system validates data against the schema during loading and provides automatic type conversion, ensuring type safety across the data pipeline.","intents":["Define a dataset schema with image and audio columns that automatically decode media files to PIL Images or librosa arrays","Validate that loaded data matches the expected schema and catch type mismatches early","Convert between storage formats (e.g., JPEG bytes in Parquet) and in-memory formats (PIL Image) transparently"],"best_for":["Computer vision teams working with image datasets that need automatic format conversion","Audio processing pipelines that need to handle multiple audio codecs and sample rates","Data engineers building data validation pipelines with strict schema enforcement"],"limitations":["Media decoding is eager by default; large images/audio files are decoded into memory, potentially causing OOM for high-resolution datasets","Custom feature types require subclassing Feature base class; no declarative schema language like Avro or Protobuf","Type conversion is automatic but may fail silently for edge cases (e.g., corrupted image files); error handling is minimal"],"requires":["Python 3.8+","Pillow (for Image features)","librosa or soundfile (for Audio features)","ffmpeg (for Video features, optional)"],"input_types":["Python type annotations","Feature class instances","JSON schema-like dictionaries"],"output_types":["PyArrow schema","Validated Python objects","PIL Images, NumPy arrays, librosa audio arrays"],"categories":["data-processing-analysis","image-visual"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_5","uri":"capability://automation.workflow.distributed.dataset.processing.with.worker.sharding.and.synchronization","name":"distributed dataset processing with worker sharding and synchronization","description":"The distributed module enables parallel processing of datasets across multiple workers (processes or machines) by automatically sharding data and coordinating transformations. Each worker receives a subset of the dataset based on its rank and world size, with built-in synchronization primitives to ensure consistent state across workers. The system handles distributed map operations, aggregations, and shuffle operations while managing communication overhead and load balancing.","intents":["Apply a slow preprocessing function to a large dataset by distributing the work across 8 GPUs, with each GPU processing a different shard","Aggregate statistics across a distributed dataset (e.g., compute mean/std) without materializing the full dataset on a single machine","Perform distributed shuffling of a dataset across multiple workers while maintaining reproducibility"],"best_for":["ML teams training on multi-GPU setups who need efficient data loading across workers","Data processing pipelines that need to scale beyond single-machine memory limits","Distributed training frameworks (PyTorch DDP, TensorFlow distributed) that need coordinated data loading"],"limitations":["Requires explicit worker coordination; no automatic discovery of available workers (must be configured via environment variables or config)","Distributed operations add communication overhead; effective speedup depends on computation-to-communication ratio","Shuffle operations require inter-worker communication; no built-in optimization for shuffle-heavy workloads"],"requires":["Python 3.8+","Distributed training framework (PyTorch DDP, Horovod, or manual process spawning)","Environment variables: RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT"],"input_types":["Dataset or DatasetDict","Transformation functions","Aggregation functions"],"output_types":["Sharded Dataset (one per worker)","Aggregated results (gathered on rank 0)","Distributed statistics"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_6","uri":"capability://search.retrieval.semantic.search.and.vector.indexing.with.faiss.and.elasticsearch.backends","name":"semantic search and vector indexing with faiss and elasticsearch backends","description":"The search module enables semantic search over datasets by building vector indices using Faiss (for in-memory similarity search) or Elasticsearch (for distributed search). The system computes embeddings for specified columns (via a user-provided embedding function), stores them in the index, and provides efficient nearest-neighbor retrieval. The architecture abstracts the underlying index backend, allowing seamless switching between Faiss (fast, single-machine) and Elasticsearch (distributed, persistent).","intents":["Build a semantic search index over a dataset of documents by embedding them with a pre-trained model and retrieve top-k similar documents","Find duplicate or near-duplicate examples in a dataset by computing embeddings and searching for neighbors with high similarity","Implement a retrieval-augmented generation (RAG) system where documents are indexed and retrieved based on query embeddings"],"best_for":["NLP teams building semantic search systems over document collections","Data quality engineers detecting duplicates or near-duplicates in datasets","ML engineers implementing RAG pipelines that need efficient document retrieval"],"limitations":["Embedding computation is a bottleneck; indexing large datasets requires significant time and memory for embedding storage","Faiss backend is single-machine only; Elasticsearch requires separate infrastructure setup and maintenance","Index updates require full recomputation; incremental index updates are not supported"],"requires":["Python 3.8+","Faiss (for in-memory search) or Elasticsearch (for distributed search)","Embedding model (e.g., sentence-transformers, OpenAI embeddings)","Sufficient RAM for Faiss indices (~4 bytes per dimension per example)"],"input_types":["Dataset with text or other columns to embed","Embedding function (callable returning vector)","Query vector or text"],"output_types":["Faiss Index or Elasticsearch index","Top-k similar examples with scores","Similarity scores"],"categories":["search-retrieval","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_7","uri":"capability://tool.use.integration.framework.specific.formatters.for.pytorch.tensorflow.jax.and.numpy","name":"framework-specific formatters for pytorch, tensorflow, jax, and numpy","description":"The set_format() method configures how dataset examples are returned when iterating, with specialized formatters for PyTorch (returns torch.Tensor), TensorFlow (returns tf.Tensor), JAX (returns jax.numpy arrays), and NumPy (returns NumPy arrays). Each formatter handles type conversion, batching, and padding automatically, enabling seamless integration with framework-specific training loops. The system maintains the underlying Arrow storage while providing framework-specific views on demand.","intents":["Configure a dataset to return PyTorch tensors automatically when iterating, without manual conversion in the training loop","Use the same dataset object with both PyTorch and TensorFlow models by switching formatters","Automatically batch and pad variable-length sequences (e.g., tokenized text) to framework-specific tensor formats"],"best_for":["ML engineers building training loops that need framework-specific tensor formats without boilerplate conversion","Teams using multiple frameworks (PyTorch + TensorFlow) that want a unified dataset interface","Researchers prototyping models across frameworks with minimal code changes"],"limitations":["Formatter conversion adds latency (~1-5ms per batch) for type conversion and padding; not suitable for extremely high-throughput scenarios","Padding logic is basic (zero-padding); complex padding strategies require custom formatters","Formatter state is global per dataset; switching formatters mid-iteration can cause unexpected behavior"],"requires":["Python 3.8+","PyTorch, TensorFlow, JAX, or NumPy (depending on desired formatter)","Columns must be compatible with target framework types"],"input_types":["Dataset with numeric or sequence columns","Formatter type string ('torch', 'tensorflow', 'jax', 'numpy')"],"output_types":["torch.Tensor","tf.Tensor","jax.numpy.ndarray","NumPy ndarray"],"categories":["tool-use-integration","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_8","uri":"capability://automation.workflow.dataset.versioning.and.hub.repository.management.with.git.based.tracking","name":"dataset versioning and hub repository management with git-based tracking","description":"The push_to_hub() method uploads datasets to Hugging Face Hub repositories with automatic Git-based version control, enabling dataset versioning, branching, and collaboration. The system manages dataset files (Parquet, metadata) as Git LFS objects, tracks changes across versions, and provides dataset cards (documentation) with standardized metadata. The architecture integrates with the Hub API for authentication, access control, and dataset discoverability.","intents":["Upload a preprocessed dataset to Hugging Face Hub so team members can load it with a single line of code","Version a dataset across multiple iterations (v1.0, v1.1, v2.0) with Git-based change tracking and rollback capability","Create a dataset card with metadata (description, license, citations) that appears on the Hub for discoverability"],"best_for":["Dataset maintainers publishing datasets for community use on Hugging Face Hub","Teams collaborating on dataset curation with version control and change tracking","Researchers sharing preprocessed datasets alongside papers for reproducibility"],"limitations":["Requires Hugging Face Hub account and authentication; no support for other dataset repositories","Large datasets (> 5GB) may have slow upload times; no built-in resumable upload or parallel transfer","Git LFS has storage limits on free Hub accounts; large datasets may require paid storage"],"requires":["Python 3.8+","Hugging Face account with Hub write access","Git and Git LFS installed locally","HF_TOKEN environment variable or huggingface-cli login"],"input_types":["Dataset or DatasetDict","Dataset card metadata (YAML or Markdown)","Private/public visibility flag"],"output_types":["Hub repository URL","Dataset card page","Git commit hash"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"pypi_pypi-datasets__cap_9","uri":"capability://data.processing.analysis.batch.processing.with.configurable.batch.sizes.and.dynamic.padding","name":"batch processing with configurable batch sizes and dynamic padding","description":"The batch() method groups dataset examples into fixed-size batches with optional dynamic padding to handle variable-length sequences. The system supports both static batching (fixed batch size) and dynamic batching (variable batch size based on token count), with automatic padding to the maximum length in each batch. The implementation integrates with the formatter system to return batches in framework-specific formats (PyTorch, TensorFlow, etc.).","intents":["Group dataset examples into batches of size 32 for training, with automatic padding of variable-length sequences to the max length in each batch","Implement dynamic batching where batch size varies based on total token count, maximizing GPU utilization for variable-length sequences","Create batches with custom collation logic (e.g., sorting by length before batching) for efficient processing"],"best_for":["NLP teams training on variable-length sequences (text, tokens) that need efficient batching with padding","ML engineers optimizing GPU utilization through dynamic batching strategies","Data scientists building data loaders with custom collation logic"],"limitations":["Padding is applied per-batch, not globally; different batches may have different padding amounts, affecting reproducibility","Dynamic batching requires pre-computed sequence lengths; no automatic length inference for complex data types","Batch size is fixed after configuration; no adaptive batching based on available memory"],"requires":["Python 3.8+","Dataset with numeric or sequence columns","Optional: custom collate function for advanced batching logic"],"input_types":["Dataset","Batch size (integer)","Drop last flag (boolean)","Custom collate function (optional)"],"output_types":["Batched Dataset","Batches as dictionaries or framework tensors","Batch metadata (batch size, padding info)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":26,"verified":false,"data_access_risk":"high","permissions":["Python 3.8+","PyArrow 1.0+","Sufficient RAM to hold dataset in memory","Network connectivity for remote datasets","Configurable buffer size based on available RAM (default 1000 examples)","Files accessible via local filesystem or remote URLs","Glob pattern matching support","Dataset with sufficient examples for desired split ratios","Optional: label column for stratified splitting","Dataset object or metadata dictionary"],"failure_modes":["Entire dataset must fit in available RAM; no built-in out-of-core processing for datasets > system memory","Transformations are lazy but still require full dataset scan on first execution, adding latency for large datasets","Arrow Table schema is immutable after creation; schema changes require full dataset rewrite","No random access to dataset elements; iteration is strictly sequential, preventing shuffling without buffering entire dataset","Streaming introduces network latency; effective throughput depends on download speed and buffer size tuning","Reproducibility requires explicit seed management; default behavior may not guarantee deterministic ordering across runs","Pattern matching is glob-based; complex file naming schemes may require custom matching logic","File discovery is eager; listing thousands of files can be slow on network file systems","No built-in support for compressed archives (tar.gz, zip); files must be extracted first","Stratification requires a label column; no automatic stratification for multi-label or regression tasks","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.05,"quality":0.35,"ecosystem":0.52,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.3,"quality":0.25,"ecosystem":0.1,"match_graph":0.3,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:05.295Z","last_scraped_at":"2026-05-03T15:20:15.343Z","last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=pypi-datasets","compare_url":"https://unfragile.ai/compare?artifact=pypi-datasets"}},"signature":"UFtYUtGheia+PJywG4tqW5KFKdkCg9Aol7FyUrh9krEdGuKdS7mshiGCkhr6wN23lqHrXuub0GrPFCjqxTgvBA==","signedAt":"2026-06-21T15:07:19.884Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/pypi-datasets","artifact":"https://unfragile.ai/pypi-datasets","verify":"https://unfragile.ai/api/v1/verify?slug=pypi-datasets","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}