{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"mbpp","slug":"mbpp","name":"MBPP+","type":"benchmark","url":"https://github.com/evalplus/evalplus","page_url":"https://unfragile.ai/mbpp","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"mbpp__cap_0","uri":"capability://data.processing.analysis.extended.test.case.generation.with.35x.multiplier.for.python.code.evaluation","name":"extended test case generation with 35x multiplier for python code evaluation","description":"Generates augmented test suites for MBPP problems by creating 35x more test cases than the original benchmark through systematic edge-case and boundary-condition generation. The system maintains structured metadata for each problem including base_input (original tests), plus_input (extended tests), contract (input validation constraints), atol (floating-point tolerance), canonical_solution (ground truth), and entry_point (function name). This architectural separation enables rigorous detection of fragile solutions that pass shallow tests but fail on edge cases, addressing the fundamental limitation that original MBPP's ~3 tests per task miss correctness issues.","intents":["Evaluate code generation models with higher rigor to catch solutions that pass original tests but are fundamentally incorrect","Identify fragile implementations that work on common cases but fail on edge cases and boundary conditions","Benchmark LLM code quality against a more comprehensive test surface than existing datasets","Create reproducible evaluation datasets that expose model weaknesses in handling corner cases"],"best_for":["ML researchers evaluating code generation models (Codex, GPT-4, Claude, etc.)","Teams building code synthesis systems who need rigorous correctness metrics","Benchmark maintainers seeking to improve evaluation signal beyond shallow test coverage"],"limitations":["Test generation is Python-specific; no support for other languages in MBPP+","Extended tests may have higher variance in execution time, requiring dynamic timeout calculation","Test case generation quality depends on the canonical_solution correctness; bugs in ground truth propagate","Floating-point comparison requires manual atol specification per problem, adding maintenance overhead"],"requires":["Python 3.7+","Access to MBPP+ dataset (378 problems with extended test suites)","Canonical solutions for each problem to establish ground truth behavior"],"input_types":["Python function implementations (as strings or AST)","Problem specifications with entry_point function names","Input validation contracts and tolerance parameters"],"output_types":["Test execution results (pass/fail per test case)","Structured test metadata (base_input, plus_input, contract, atol)","Pass@k metrics aggregated across test suites"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_1","uri":"capability://safety.moderation.safe.isolated.execution.of.untrusted.llm.generated.code.with.multi.layer.resource.guards","name":"safe isolated execution of untrusted llm-generated code with multi-layer resource guards","description":"Executes arbitrary Python code generated by LLMs in isolated processes with enforced resource limits and system call restrictions to prevent malicious or buggy code from crashing the evaluation framework. The untrusted_check function spawns separate processes via multiprocessing with shared memory IPC, applies memory limits (default 4GB via EVALPLUS_MAX_MEMORY_BYTES environment variable), dynamically calculated time limits based on ground truth execution time, I/O suppression via swallow_io to prevent output pollution, and reliability_guard to disable dangerous system calls. This architecture prevents code injection, infinite loops, memory exhaustion, and filesystem access while maintaining execution fidelity for correctness evaluation.","intents":["Safely execute untrusted code from LLM models without risking framework stability or security","Isolate test execution failures so one buggy solution doesn't crash the entire evaluation run","Measure code correctness without interference from stdout/stderr pollution or side effects","Enforce resource constraints to prevent denial-of-service attacks via infinite loops or memory bombs"],"best_for":["Evaluation frameworks processing code from untrusted sources (LLM outputs, user submissions)","CI/CD pipelines that need to safely execute generated code without manual review","Researchers benchmarking multiple models where some outputs may be adversarial or buggy"],"limitations":["Process isolation adds ~50-200ms overhead per execution due to IPC and process spawning","Memory limits are coarse-grained (per-process, not per-function); cannot track memory per code block","Time limits are dynamically calculated from canonical_solution, which may be suboptimal for slow reference implementations","I/O suppression prevents legitimate logging/debugging output; no fine-grained output filtering","Reliability_guard disables system calls but may be overly restrictive for code that needs file I/O or networking"],"requires":["Python 3.7+ with multiprocessing support","Linux/Unix OS (resource guards and per-process memory limits are Unix-specific)","EVALPLUS_MAX_MEMORY_BYTES environment variable (optional; defaults to 4GB)","Ground truth execution times for dynamic timeout calculation"],"input_types":["Python function code as strings","Test inputs (arguments to pass to the function)","Expected outputs for comparison","Timeout and memory limit parameters"],"output_types":["Execution result (pass/fail/timeout/memory-exceeded/error)","Actual output from the function","Execution time and resource usage metrics","Error messages and stack traces (if execution failed)"],"categories":["safety-moderation","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_2","uri":"capability://code.generation.editing.code.sanitization.and.normalization.for.consistent.evaluation.across.llm.outputs","name":"code sanitization and normalization for consistent evaluation across llm outputs","description":"Preprocesses LLM-generated code to normalize formatting, remove extraneous content, and extract the target function before execution. The sanitize module (evalplus/sanitize.py) handles variable formatting inconsistencies, removes comments and docstrings that may interfere with parsing, extracts the function matching the entry_point name, and validates syntax before execution. This ensures that evaluation results reflect code correctness rather than formatting quirks or LLM hallucinations like extra imports or wrapper code. The sanitization pipeline is essential because different LLMs produce code with different indentation, naming conventions, and structural patterns that would otherwise cause false negatives.","intents":["Normalize code from different LLM models so evaluation is fair and consistent across providers","Extract the target function from LLM outputs that may include imports, helper functions, or explanatory text","Remove formatting variations (indentation, whitespace, comments) that don't affect correctness but break parsing","Validate code syntax before execution to provide clear error messages for malformed outputs"],"best_for":["Evaluation pipelines comparing multiple LLM models with different output formatting conventions","Benchmarks that need to isolate correctness evaluation from code style or presentation","Researchers studying how LLM output quality varies across models and prompts"],"limitations":["Sanitization may remove legitimate code (e.g., necessary imports or helper functions) if they're not recognized as part of the target function","Cannot handle code with syntax errors; requires valid Python syntax after sanitization","Docstring removal may lose important context about function behavior, though this doesn't affect execution","Entry point matching is name-based; cannot disambiguate if multiple functions have the same name","No support for non-Python languages; sanitization is Python-specific"],"requires":["Python 3.7+","AST parsing library (built-in ast module)","Valid Python syntax in the input code (after sanitization)","Correct entry_point function name specification"],"input_types":["Raw LLM-generated code (as strings)","Entry point function name to extract","Optional: expected function signature for validation"],"output_types":["Sanitized, executable Python code (as string)","Extracted function definition","Validation status (syntax-valid, entry-point-found, etc.)","Error messages if sanitization fails"],"categories":["code-generation-editing","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_3","uri":"capability://tool.use.integration.multi.backend.llm.integration.for.code.generation.with.8.provider.support","name":"multi-backend llm integration for code generation with 8+ provider support","description":"Provides unified interface to generate code from 8+ LLM backends including vLLM, HuggingFace, OpenAI, Anthropic, Google Gemini, AWS Bedrock, and Ollama. The provider architecture (evalplus/provider/) abstracts backend-specific API details behind a common interface, handling authentication, request formatting, response parsing, and error handling for each provider. This enables researchers to benchmark code generation across different models and providers without rewriting evaluation code. The codegen module (evalplus/codegen.py) orchestrates the generation pipeline: problem specification → prompt formatting → LLM call → response extraction → sanitization → evaluation.","intents":["Benchmark code generation quality across multiple LLM models and providers in a single evaluation run","Switch between local models (vLLM, Ollama) and cloud APIs (OpenAI, Anthropic, Gemini) without code changes","Generate multiple code samples per problem (pass@k evaluation) by calling the same model multiple times","Integrate custom LLM providers by implementing the provider interface"],"best_for":["Researchers comparing code generation capabilities across OpenAI, Anthropic, Google, and open-source models","Teams evaluating whether to use cloud APIs or self-hosted models for code generation","Benchmark maintainers who need to support multiple LLM backends without duplicating integration code"],"limitations":["Provider implementations are tightly coupled to each API's authentication and request format; adding new providers requires code changes","Rate limiting and quota management are provider-specific; no unified rate limiting across backends","Response parsing assumes specific output formats; LLM hallucinations or unexpected formats may break parsing","No built-in retry logic or fallback mechanisms if a provider is unavailable","Cost tracking is not integrated; users must manually track API spending across providers"],"requires":["Python 3.7+","Provider-specific credentials: OpenAI API key, Anthropic API key, Google Cloud credentials, AWS credentials, HuggingFace token, or local vLLM/Ollama server","Network access to cloud APIs or local server for self-hosted models","Problem specifications in MBPP+ format (with entry_point and canonical_solution)"],"input_types":["Problem specification (description, entry_point, canonical_solution, test cases)","Model name/identifier (e.g., 'gpt-4', 'claude-3-opus', 'meta-llama/Llama-2-7b')","Generation parameters (temperature, max_tokens, top_p, etc.)","Number of samples to generate (for pass@k evaluation)"],"output_types":["Generated code (as string)","Model metadata (name, provider, generation parameters)","Generation metrics (tokens used, latency, cost if tracked)","Multiple samples per problem (for pass@k calculation)"],"categories":["tool-use-integration","code-generation-editing"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_4","uri":"capability://data.processing.analysis.pass.k.metric.calculation.with.configurable.sample.aggregation","name":"pass@k metric calculation with configurable sample aggregation","description":"Computes pass@k metrics by generating multiple code samples per problem and calculating the probability that at least one sample passes all tests. The metric is calculated as: pass@k = 1 - (C(n-c, k) / C(n, k)) where n is total samples, c is passing samples, and k is the sample count. This enables evaluation of model reliability: pass@1 measures single-shot accuracy, while pass@10 or pass@100 measures whether the model can eventually generate correct code. The framework aggregates results across all problems to produce dataset-level pass@k scores, enabling comparison of models' code generation reliability.","intents":["Measure code generation model reliability by evaluating multiple samples per problem","Compare models fairly using pass@k metrics that account for sampling variance","Determine whether a model can eventually generate correct code even if single-shot accuracy is low","Track improvement in model capability as sample count increases (pass@1 → pass@10 → pass@100)"],"best_for":["Researchers benchmarking code generation models where single-shot accuracy is insufficient","Teams evaluating whether to use best-of-n sampling or iterative refinement strategies","Model developers tracking how model improvements affect pass@k across different sample counts"],"limitations":["Pass@k assumes independence between samples, which may not hold if the model produces correlated errors","Requires generating k samples per problem, multiplying evaluation cost by k (e.g., 10x cost for pass@10)","Pass@k is less interpretable than pass@1 for users who need single-shot accuracy guarantees","Metric is sensitive to k value; pass@100 may be artificially high if k exceeds the number of distinct solutions","No confidence intervals or statistical significance testing built-in; requires external analysis"],"requires":["Multiple code samples per problem (typically 1-100 samples)","Test execution results for each sample (pass/fail per test case)","Sample count k (typically 1, 10, 25, 100)","Problem count n for binomial coefficient calculation"],"input_types":["Execution results for multiple samples per problem (pass/fail boolean)","Sample count k (integer)","Problem specifications (for aggregation across dataset)"],"output_types":["Pass@k score (float 0-1) per problem","Aggregated pass@k score across dataset","Per-model pass@k curves (pass@1, pass@10, pass@25, pass@100, etc.)","Comparison tables across models"],"categories":["data-processing-analysis","planning-reasoning"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_5","uri":"capability://data.processing.analysis.performance.evaluation.via.cpu.instruction.counting.with.evalperf.dataset","name":"performance evaluation via cpu instruction counting with evalperf dataset","description":"Measures code efficiency using CPU instruction counting rather than wall-clock time, enabling reproducible performance evaluation across different hardware. The EvalPerf dataset generates performance-exercising inputs with exponential scaling (2^1 to 2^26 elements) to stress-test algorithmic complexity. The profiling pipeline uses Linux perf counters to measure CPU instructions, filters tasks based on profile size, compute cost, coefficient of variation, and performance clustering to select representative benchmarks. This approach isolates algorithmic efficiency from hardware variance, enabling rigorous comparison of code quality across models and implementations.","intents":["Evaluate code efficiency beyond correctness, measuring algorithmic quality and implementation optimization","Compare models on performance metrics that are reproducible across different hardware","Identify inefficient implementations that pass correctness tests but have poor algorithmic complexity","Benchmark code generation models on both correctness (pass@k) and efficiency (instruction count)"],"best_for":["Researchers evaluating code generation models on both correctness and efficiency","Teams building systems where code performance matters (e.g., competitive programming, embedded systems)","Benchmark maintainers seeking to measure algorithmic quality beyond functional correctness"],"limitations":["Requires Linux with perf counter support; not available on Windows or macOS","CPU instruction counting is hardware-specific; results may vary across CPU architectures","Exponential scaling (2^1 to 2^26) may cause timeout or memory issues for inefficient algorithms","Performance evaluation requires ground truth execution time for timeout calculation, adding overhead","EvalPerf dataset is smaller than MBPP+ (subset of tasks); not all problems have performance benchmarks","Instruction counting overhead (~5-10%) may affect results for very fast code"],"requires":["Linux OS with perf counter support (Linux 2.6.31+)","Python 3.7+","EvalPerf dataset (subset of MBPP+ with performance benchmarks)","Ground truth implementations for timeout calibration","Sufficient memory to handle exponential input scaling (up to 2^26 elements)"],"input_types":["Python function implementations","Performance-exercising inputs (generated with exponential scaling)","Problem specifications with entry_point and canonical_solution","Timeout and memory limit parameters"],"output_types":["CPU instruction count (via perf counters)","Execution time (wall-clock)","Memory usage","Performance metrics (instructions per operation, complexity classification)","Efficiency comparison across models"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_6","uri":"capability://data.processing.analysis.structured.dataset.management.with.problem.metadata.and.test.case.organization","name":"structured dataset management with problem metadata and test case organization","description":"Organizes MBPP+ problems as structured JSON with metadata fields: base_input (original test cases), plus_input (extended test cases), contract (input validation constraints), atol (floating-point tolerance), canonical_solution (ground truth implementation), and entry_point (function name). The dataset management system (evalplus/data/) loads problems from JSON, validates metadata consistency, and provides programmatic access to test cases and solutions. This structured approach enables systematic evaluation: problems can be filtered by category, difficulty, or test coverage; test cases can be aggregated across base and plus inputs; and metadata enables reproducible evaluation across different tools and frameworks.","intents":["Load and access MBPP+ problems programmatically with consistent metadata structure","Filter problems by category, difficulty, or test coverage for targeted evaluation","Aggregate test results across base_input and plus_input for comprehensive correctness assessment","Export evaluation results in standardized formats for comparison and publication"],"best_for":["Researchers building custom evaluation pipelines that need programmatic access to MBPP+ problems","Benchmark maintainers maintaining and versioning the MBPP+ dataset","Teams integrating MBPP+ into CI/CD pipelines or automated testing systems"],"limitations":["Dataset is Python-specific; no support for other languages","Metadata is static; cannot dynamically generate new test cases or update canonical solutions","JSON format may be inefficient for very large datasets; no built-in compression or indexing","No versioning system for dataset updates; breaking changes could affect reproducibility","Contract validation is optional; some problems may have incomplete or incorrect constraints"],"requires":["Python 3.7+","MBPP+ dataset files (JSON format)","JSON parsing library (built-in json module)","Disk space for dataset (~50-100MB for MBPP+ with extended tests)"],"input_types":["MBPP+ dataset files (JSON)","Problem IDs or names for filtering","Optional: category or difficulty filters"],"output_types":["Problem objects with metadata (base_input, plus_input, contract, atol, canonical_solution, entry_point)","Test case lists (aggregated or separated)","Problem statistics (count, coverage, difficulty distribution)","Exported results in standardized formats (JSON, CSV, etc.)"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_7","uri":"capability://automation.workflow.command.line.evaluation.pipeline.with.end.to.end.orchestration","name":"command-line evaluation pipeline with end-to-end orchestration","description":"Provides CLI tools (evalplus.evaluate, evalplus.codegen, evalplus.evalperf, evalplus.sanitize) that orchestrate the complete evaluation workflow: code generation → sanitization → correctness evaluation → optional performance evaluation. The evaluate command executes generated code against MBPP+ test suites with configurable timeouts and memory limits, producing pass@k metrics and detailed result logs. The codegen command generates code from specified LLM providers. The evalperf command measures performance via instruction counting. The sanitize command preprocesses code before evaluation. This modular CLI design enables researchers to run evaluation pipelines without writing custom code, supporting reproducible benchmarking and result sharing.","intents":["Run complete evaluation pipelines from command line without writing custom Python code","Generate code from multiple LLM providers and evaluate in a single command","Reproduce evaluation results across different machines and environments","Share evaluation configurations and results with other researchers"],"best_for":["Researchers who prefer CLI tools over programmatic APIs","Teams running evaluation pipelines in CI/CD systems or batch processing","Benchmark maintainers publishing standardized evaluation commands"],"limitations":["CLI interface is less flexible than programmatic API; complex workflows require shell scripting","Configuration is command-line arguments or config files; no interactive configuration","Error messages may be cryptic for users unfamiliar with the framework","No built-in progress reporting or logging; users must parse output or redirect to files","CLI is Python-specific; cannot be easily integrated into non-Python workflows"],"requires":["Python 3.7+ with evalplus package installed","Provider credentials (API keys, local servers) for code generation","MBPP+ dataset files","Bash or shell for running commands"],"input_types":["Command-line arguments (model name, provider, problem IDs, etc.)","Configuration files (JSON or YAML with evaluation parameters)","Code files or directories to evaluate","Problem specifications (from MBPP+ dataset)"],"output_types":["Evaluation results (pass/fail per problem, pass@k metrics)","Detailed logs (execution times, error messages, resource usage)","Result files (JSON, CSV) for further analysis","Performance metrics (if evalperf is enabled)"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_8","uri":"capability://data.processing.analysis.comprehensive.result.logging.and.visualization.for.evaluation.analysis","name":"comprehensive result logging and visualization for evaluation analysis","description":"Captures detailed execution logs including per-problem pass/fail status, execution times, error messages, resource usage (memory, CPU), and pass@k metrics. Results are exported in structured formats (JSON, CSV) enabling downstream analysis, visualization, and comparison. The logging system tracks execution metadata (model name, provider, generation parameters, timestamp) alongside correctness and performance metrics, enabling reproducible result tracking and publication. Visualization utilities generate comparison tables, pass@k curves, and per-category breakdowns, supporting research communication and model comparison.","intents":["Track detailed evaluation results for reproducibility and publication","Compare models across multiple metrics (pass@k, performance, resource usage)","Visualize evaluation results for research papers and presentations","Debug evaluation failures by examining detailed execution logs"],"best_for":["Researchers publishing evaluation results and needing reproducible logging","Teams comparing multiple models and needing detailed comparison tables","Benchmark maintainers tracking evaluation results over time"],"limitations":["Log files can be large for large-scale evaluations (100s of models × 1000s of problems); requires disk space management","Visualization utilities are basic; complex analysis requires external tools (matplotlib, pandas)","No built-in statistical significance testing; requires external analysis","Result formats are fixed (JSON, CSV); custom formats require post-processing","No built-in result versioning or change tracking; users must manually manage result history"],"requires":["Python 3.7+","Disk space for result logs (~1-10MB per evaluation run)","Optional: pandas, matplotlib for advanced analysis and visualization"],"input_types":["Execution results (pass/fail per problem, execution times, error messages)","Model metadata (name, provider, generation parameters)","Evaluation configuration (timeouts, memory limits, sample counts)"],"output_types":["Structured result files (JSON, CSV)","Comparison tables (models × metrics)","Pass@k curves (sample count vs pass rate)","Per-category breakdowns (problem category × metric)","Visualization plots (if matplotlib is available)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__cap_9","uri":"capability://data.processing.analysis.comprehensive.test.result.aggregation.and.reporting","name":"comprehensive-test-result-aggregation-and-reporting","description":"Aggregates execution results across all 378 problems and k samples to produce comprehensive benchmark metrics: pass@k scores, per-problem pass/fail results, sample-level execution details (timeout, memory exceeded, exception), and statistical summaries (mean, std dev, confidence intervals). Results are organized hierarchically (benchmark → problem → sample) and exported as structured JSON for further analysis and visualization.","intents":["I need to aggregate test results across hundreds of problems and samples","I want to produce pass@k metrics and statistical summaries for benchmark reporting","I need to identify which problems are hardest and which samples failed"],"best_for":["benchmark maintainers producing leaderboard results","researchers analyzing model performance across problem categories","teams generating benchmark reports and visualizations"],"limitations":["Aggregation assumes all problems are equally weighted; no support for weighted metrics","Statistical summaries are basic (mean, std dev); no advanced statistical analysis","No built-in visualization; results are JSON only; requires external tools for charts/graphs","Aggregation is post-hoc; no streaming/incremental result processing","No support for comparing results across different evaluation runs or models"],"requires":["Python 3.8+","Complete execution results for all problems and samples","Problem metadata (for organizing results)"],"input_types":["per-sample execution results (pass/fail, error type, execution time)","problem identifiers"],"output_types":["pass@k scores (float 0.0-1.0)","per-problem results (pass/fail)","statistical summaries (mean, std dev)","JSON report with hierarchical structure"],"categories":["data-processing-analysis","testing-quality"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"mbpp__headline","uri":"capability://testing.quality.rigorous.evaluation.framework.for.llm.generated.code","name":"rigorous evaluation framework for llm-generated code","description":"MBPP+ is an enhanced benchmark for evaluating code generated by large language models, providing 35x more test cases to ensure thorough correctness and performance assessment.","intents":["best code evaluation framework","LLM code testing for accuracy","how to evaluate AI-generated code","most comprehensive benchmarks for code generation","top tools for assessing code quality"],"best_for":["developers testing LLM outputs","researchers validating AI code generation"],"limitations":[],"requires":[],"input_types":[],"output_types":[],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":63,"verified":false,"data_access_risk":"high","permissions":["Python 3.7+","Access to MBPP+ dataset (378 problems with extended test suites)","Canonical solutions for each problem to establish ground truth behavior","Python 3.7+ with multiprocessing support","Linux/Unix OS (resource guards and per-process memory limits are Unix-specific)","EVALPLUS_MAX_MEMORY_BYTES environment variable (optional; defaults to 4GB)","Ground truth execution times for dynamic timeout calculation","AST parsing library (built-in ast module)","Valid Python syntax in the input code (after sanitization)","Correct entry_point function name specification"],"failure_modes":["Test generation is Python-specific; no support for other languages in MBPP+","Extended tests may have higher variance in execution time, requiring dynamic timeout calculation","Test case generation quality depends on the canonical_solution correctness; bugs in ground truth propagate","Floating-point comparison requires manual atol specification per problem, adding maintenance overhead","Process isolation adds ~50-200ms overhead per execution due to IPC and process spawning","Memory limits are coarse-grained (per-process, not per-function); cannot track memory per code block","Time limits are dynamically calculated from canonical_solution, which may be suboptimal for slow reference implementations","I/O suppression prevents legitimate logging/debugging output; no fine-grained output filtering","Reliability_guard disables system calls but may be overly restrictive for code that needs file I/O or networking","Sanitization may remove legitimate code (e.g., necessary imports or helper functions) if they're not recognized as part of the target function","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:04.693Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=mbpp","compare_url":"https://unfragile.ai/compare?artifact=mbpp"}},"signature":"giu9FXJ7A0dr+BY/8Nncp2NAs9aBki2Y6TwW21WziOZI4sQh7d2Hu8AkuRr76O9Dm149NxXBBH3UdLsjXec4Aw==","signedAt":"2026-06-20T04:51:58.736Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/mbpp","artifact":"https://unfragile.ai/mbpp","verify":"https://unfragile.ai/api/v1/verify?slug=mbpp","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}