{"passport":{"unfragile":{"@version":"1.0","version":"2026-05","artifact":{"id":"alpacaeval","slug":"alpacaeval","name":"AlpacaEval","type":"benchmark","url":"https://github.com/tatsu-lab/alpaca_eval","page_url":"https://unfragile.ai/alpacaeval","categories":["testing-quality"],"tags":[],"pricing":{"model":"free","free":true,"starting_price":null},"status":"active","verified":false},"capabilities":[{"id":"alpacaeval__cap_0","uri":"capability://data.processing.analysis.llm.as.judge.pairwise.comparison.with.length.controlled.win.rate","name":"llm-as-judge pairwise comparison with length-controlled win rate","description":"Automatically evaluates instruction-following model outputs by using a judge LLM (GPT-4, Claude, etc.) to perform pairwise comparisons between two model responses on the same instruction. Implements length-controlled win rate calculation that normalizes for output length bias by penalizing verbosity, preventing longer but lower-quality outputs from unfairly winning comparisons. The system uses configurable judge prompts and completion parsers to extract structured win/loss decisions from judge LLM outputs.","intents":["Compare two LLM models on instruction-following ability without human annotation","Rank multiple models against each other using automated pairwise tournaments","Evaluate model quality while controlling for the confound that judges prefer longer outputs","Get reproducible, quantitative scores for model performance across instruction sets"],"best_for":["ML researchers benchmarking instruction-tuned models","Teams evaluating proprietary LLMs without access to human raters","Organizations needing fast (<5 minute) evaluation cycles during model development"],"limitations":["Judge LLM quality directly impacts evaluation validity — weak judges (e.g., smaller open models) show lower correlation with human judgments","Pairwise comparison scales quadratically with model count; evaluating 20 models requires ~190 comparisons","Length-controlled win rate assumes length penalty is uniform across instruction types; some tasks may legitimately require longer responses","Requires API access to a capable judge model (GPT-4, Claude) or local inference infrastructure; cannot use weak local models as judges"],"requires":["Python 3.8+","API key for OpenAI (gpt-4, gpt-3.5-turbo) OR Anthropic (claude-3) OR local model server (vLLM, Ollama)","Model outputs in JSON format with 'instruction', 'output' fields","Instruction dataset (AlpacaEval provides 805 instruction-following examples)"],"input_types":["JSON with model outputs and instructions","CSV/JSONL with instruction-output pairs","Reference outputs (optional, for comparison)"],"output_types":["Win rate scores (0-100)","Pairwise comparison results (JSON with judge decisions)","Leaderboard rankings (CSV/JSON)","Detailed annotations with judge reasoning"],"categories":["data-processing-analysis","evaluation-benchmarking"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_1","uri":"capability://tool.use.integration.multi.provider.judge.model.integration.with.decoder.registry","name":"multi-provider judge model integration with decoder registry","description":"Abstracts interactions with different LLM providers (OpenAI, Anthropic, Hugging Face, vLLM) through a unified Decoder interface and registry system. Each provider has a dedicated decoder class that handles authentication, API calls, response parsing, and caching. The system supports both API-based models (GPT-4, Claude) and local inference engines (vLLM, Ollama), with automatic fallback and retry logic for failed requests.","intents":["Use GPT-4 or Claude as judge without writing provider-specific code","Switch between judge models (e.g., GPT-4 to Claude) by changing a config parameter","Run evaluation locally using open-source models without cloud API costs","Cache judge responses to avoid re-evaluating identical instruction pairs"],"best_for":["Teams with multi-cloud or hybrid infrastructure (some models on OpenAI, others local)","Cost-sensitive organizations wanting to use cheaper open models as judges","Researchers comparing judge quality across different model families"],"limitations":["Local model decoders (vLLM, Ollama) require GPU infrastructure and model weights; adds deployment complexity vs. API-only approach","Cache is in-memory or file-based; no distributed cache support for multi-machine evaluation","Decoder registry is hardcoded in constants.py; adding new providers requires code changes, not configuration","API rate limits from OpenAI/Anthropic can throttle evaluation speed; no built-in queue management"],"requires":["For OpenAI: OPENAI_API_KEY environment variable","For Anthropic: ANTHROPIC_API_KEY environment variable","For Hugging Face: HF_TOKEN environment variable","For vLLM: vLLM server running on localhost:8000 (configurable)","For Ollama: Ollama service running locally"],"input_types":["Model name string (e.g., 'gpt-4', 'claude-3-opus', 'meta-llama/Llama-2-70b')","Decoder configuration YAML with provider-specific parameters"],"output_types":["Judge LLM responses (text)","Parsed completion objects with structured fields","Cached response metadata (timestamp, tokens used, cost)"],"categories":["tool-use-integration","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_10","uri":"capability://data.processing.analysis.model.output.preprocessing.and.validation","name":"model output preprocessing and validation","description":"Validates and preprocesses model outputs before evaluation, including format checking (JSON structure), field validation (required 'instruction' and 'output' fields), and optional cleaning (whitespace normalization, encoding fixes). Detects and reports malformed outputs that would cause evaluation to fail. Supports multiple input formats (JSON, JSONL, CSV) with automatic format detection and conversion to internal representation.","intents":["Validate model outputs before evaluation to catch format errors early","Convert model outputs from various formats (JSON, JSONL, CSV) to evaluation format","Clean up common issues (encoding errors, extra whitespace) without manual intervention","Generate detailed error reports for malformed outputs"],"best_for":["Teams integrating evaluation into model training pipelines with heterogeneous output formats","Organizations with strict data quality requirements","Researchers debugging evaluation failures caused by malformed outputs"],"limitations":["Validation is schema-based; cannot detect semantic errors (e.g., instruction-output mismatch)","Cleaning operations are lossy; aggressive normalization may remove intentional formatting","No automatic format conversion for complex nested structures; only flat JSON/JSONL supported","Validation errors are reported but not automatically fixed; requires manual correction"],"requires":["Model outputs in JSON, JSONL, or CSV format","Schema specification (required fields, data types)"],"input_types":["JSON/JSONL/CSV files with model outputs","Schema specification (optional, uses defaults)"],"output_types":["Validated and cleaned model outputs (JSON)","Validation report with errors and warnings","Conversion metadata (original format, transformations applied)"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_11","uri":"capability://automation.workflow.evaluation.reproducibility.through.configuration.versioning","name":"evaluation reproducibility through configuration versioning","description":"Enables reproducible evaluations by capturing all evaluation parameters (judge model, prompt template, length penalty, random seed) in YAML configuration files that can be version-controlled and shared. Evaluation results include metadata (configuration hash, evaluation date, judge model version) allowing tracing back to exact evaluation setup. Supports loading prior configurations to reproduce historical evaluation runs.","intents":["Reproduce evaluation results from published papers by loading shared configuration files","Track how evaluation methodology changes affect rankings over time","Share evaluation setup with collaborators for consistent benchmarking","Audit evaluation methodology by inspecting configuration files"],"best_for":["Research teams publishing benchmarks and wanting to enable reproducibility","Organizations maintaining internal evaluation standards across teams","Researchers studying how evaluation methodology affects model rankings"],"limitations":["Configuration captures parameters but not judge model weights; different model versions produce different results","Random seed controls sampling but not judge LLM stochasticity; same seed with different judge models produces different results","Configuration files are human-readable but not automatically validated; typos can silently change evaluation behavior","No built-in diff tool; comparing configurations requires manual inspection"],"requires":["YAML configuration file with all evaluation parameters","Version control system (Git) for tracking configuration changes"],"input_types":["YAML configuration file","Configuration version/hash"],"output_types":["Evaluation results with configuration metadata","Configuration diff (if comparing versions)","Reproducibility report"],"categories":["automation-workflow","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_2","uri":"capability://text.generation.language.configurable.judge.prompts.with.completion.parsing","name":"configurable judge prompts with completion parsing","description":"Allows customization of the prompt template used to instruct the judge LLM on how to compare two model outputs. Supports multiple evaluation methodologies (pairwise comparison, ranking, scoring) through different prompt templates stored as YAML configurations. Includes a completion parser system that extracts structured decisions (win/loss/tie) from free-form judge LLM outputs using regex patterns and heuristics, handling cases where the judge outputs ambiguous or malformed responses.","intents":["Customize judge instructions to emphasize specific evaluation criteria (e.g., safety, factuality, helpfulness)","Evaluate models on domain-specific tasks by providing task-specific judge prompts","Handle judge outputs that don't follow a strict format by using flexible parsing rules","Reproduce evaluation results from prior work by loading published judge prompt templates"],"best_for":["Researchers studying how judge prompt wording affects evaluation outcomes","Teams evaluating models on specialized domains (medical, legal, code) with custom criteria","Organizations wanting to audit judge behavior by inspecting and modifying prompts"],"limitations":["Judge prompt quality is not validated; poorly written prompts can introduce systematic bias without detection","Completion parser uses regex and heuristics; ambiguous judge outputs (e.g., 'both are good') may be misparsed as ties when the judge intended a preference","No built-in prompt optimization or A/B testing framework; comparing prompt variants requires manual re-runs","Prompt templates are stored in YAML files; no version control or audit trail for prompt changes"],"requires":["YAML configuration files with 'prompt_template' and 'completion_parser_fn' fields","Judge LLM that can follow instruction-following prompts (GPT-3.5+ or equivalent)","Model outputs and reference outputs in JSON format"],"input_types":["YAML configuration with prompt template (Jinja2 format)","Instruction and two model outputs (strings)","Optional: reference output for comparison"],"output_types":["Parsed judge decision (win/loss/tie)","Raw judge response (unparsed text)","Confidence score (if parser extracts it)"],"categories":["text-generation-language","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_3","uri":"capability://automation.workflow.batch.pairwise.evaluation.with.sampling.and.tournament.modes","name":"batch pairwise evaluation with sampling and tournament modes","description":"Orchestrates evaluation of multiple model pairs through three modes: (1) annotate_pairs() for evaluating pre-specified pairs, (2) annotate_head2head() for comparing two models across all instructions, and (3) annotate_samples() for randomly sampling pairs from a larger set of models. Implements efficient batching of judge requests to reduce API calls, with optional parallel execution across multiple judge instances. Supports tournament-style evaluation where models are ranked through transitive comparisons.","intents":["Evaluate 5+ models against each other without manually specifying all pairs","Run head-to-head comparison between two specific models on a full instruction set","Sample a subset of model pairs to estimate relative rankings without evaluating all combinations","Parallelize evaluation across multiple judge instances to reduce wall-clock time"],"best_for":["ML teams with 5-20 models to rank (pairwise comparison becomes expensive beyond 20)","Researchers studying model comparison transitivity (does A > B and B > C imply A > C?)","Organizations with budget constraints wanting to sample rather than exhaustively compare"],"limitations":["Sampling mode introduces variance in rankings; results are not deterministic without fixing random seed","Pairwise comparison is not transitive in practice; A > B and B > C does not guarantee A > C due to judge inconsistency","Batching reduces API calls but increases latency per batch; optimal batch size depends on judge model and network","No built-in statistical significance testing; cannot determine if ranking differences are meaningful"],"requires":["List of model outputs (one per model, per instruction)","Judge model configured and authenticated","Instruction dataset (typically 100-1000 instructions)"],"input_types":["JSON with model outputs: {instruction, model_a_output, model_b_output}","List of model names to compare","Sampling parameters (number of pairs, random seed)"],"output_types":["Win rate matrix (model_a vs model_b)","Ranked leaderboard with win rates","Detailed comparison results with judge reasoning"],"categories":["automation-workflow","data-processing-analysis"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_4","uri":"capability://data.processing.analysis.length.controlled.win.rate.metric.calculation","name":"length-controlled win rate metric calculation","description":"Computes a length-adjusted win rate that penalizes longer outputs to control for length bias. The metric applies a configurable length penalty function (e.g., exponential decay) to the raw win rate based on the difference in output lengths between the two models being compared. Implemented in the metrics calculation pipeline, this allows fair comparison between verbose and concise models by normalizing for the confound that judges tend to prefer longer responses.","intents":["Compare a verbose model against a concise model fairly without length bias","Measure model quality independent of verbosity","Identify models that achieve high quality with shorter outputs (efficiency metric)","Reproduce AlpacaEval leaderboard rankings which use length-controlled win rate"],"best_for":["Researchers studying the relationship between output length and quality","Teams optimizing for inference speed and want to reward concise models","Organizations wanting fair comparison across models with different verbosity tendencies"],"limitations":["Length penalty is uniform across all instructions; some tasks legitimately require longer responses (e.g., detailed explanations) and are penalized unfairly","Penalty function is configurable but not adaptive; no automatic tuning based on instruction type","Length is measured in tokens; different tokenizers (GPT-3.5 vs Claude) produce different token counts, affecting comparisons across judge models","No confidence intervals or significance testing; cannot determine if length-adjusted difference is statistically meaningful"],"requires":["Raw pairwise comparison results (win/loss/tie for each pair)","Output lengths for both models (in tokens or characters)","Length penalty function configuration (e.g., exponential decay rate)"],"input_types":["Pairwise comparison results JSON","Model output lengths (tokens or characters)","Length penalty parameters (decay rate, max penalty)"],"output_types":["Length-controlled win rate (0-100)","Length penalty applied (percentage)","Adjusted vs raw win rate comparison"],"categories":["data-processing-analysis","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_5","uri":"capability://data.processing.analysis.leaderboard.generation.and.export.with.ranking.statistics","name":"leaderboard generation and export with ranking statistics","description":"Aggregates pairwise comparison results into ranked leaderboards showing each model's win rate, number of comparisons, and ranking position. Supports multiple export formats (CSV, JSON, HTML) and includes statistical summaries (mean win rate, standard deviation, confidence intervals). The leaderboard system handles ties and incomplete comparisons, and can generate both overall rankings and per-category breakdowns (e.g., by instruction type or difficulty).","intents":["Generate a public leaderboard showing model rankings and win rates","Export evaluation results in formats compatible with papers and reports","Compare model performance across different instruction categories","Track model rankings over time as new models are evaluated"],"best_for":["Research teams publishing benchmarks and wanting to share results publicly","Organizations maintaining internal model leaderboards for stakeholder communication","Researchers studying how leaderboard design affects model selection decisions"],"limitations":["Leaderboard rankings are not transitive; model A can rank higher than B, B higher than C, but C higher than A due to judge inconsistency","No built-in handling of evaluation date/version; cannot easily track how rankings change over time","Export formats are static snapshots; no built-in versioning or diff tracking for leaderboard changes","Per-category breakdowns require manual annotation of instructions; no automatic categorization"],"requires":["Pairwise comparison results for all model pairs","Model metadata (name, organization, date evaluated)","Optional: instruction categories for per-category breakdowns"],"input_types":["Pairwise comparison results JSON","Model metadata (name, version, date)","Instruction categories (optional)"],"output_types":["CSV leaderboard (model, win_rate, num_comparisons, rank)","JSON leaderboard with full metadata","HTML leaderboard for web display","Per-category breakdowns (CSV/JSON)"],"categories":["data-processing-analysis","automation-workflow"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_6","uri":"capability://automation.workflow.cli.interface.for.end.to.end.evaluation.pipeline","name":"cli interface for end-to-end evaluation pipeline","description":"Provides command-line interface for running complete evaluation workflows from model outputs to leaderboard generation. The CLI accepts configuration files (YAML) specifying model paths, judge settings, evaluation mode, and output options. Implements a main.py entry point that orchestrates the full pipeline: loading model outputs, running pairwise comparisons, calculating metrics, and exporting results. Supports both interactive and batch modes for integration into CI/CD workflows.","intents":["Run evaluation from command line without writing Python code","Integrate evaluation into CI/CD pipelines to automatically benchmark new model versions","Reproduce evaluation results by sharing configuration files","Parallelize evaluation across multiple machines using configuration-driven setup"],"best_for":["ML engineers integrating evaluation into model training pipelines","Teams wanting reproducible evaluation without custom scripting","Organizations running scheduled evaluations (e.g., nightly benchmarks)"],"limitations":["CLI is configuration-file-driven; complex custom evaluation logic requires Python scripting","No built-in distributed evaluation; parallelization requires external orchestration (e.g., Kubernetes)","Error handling is basic; failures in one comparison can halt the entire pipeline","No interactive progress monitoring; long evaluations provide minimal feedback until completion"],"requires":["Python 3.8+","AlpacaEval installed (pip install alpaca-eval)","YAML configuration file with model paths and evaluation settings","Model outputs in JSON format"],"input_types":["YAML configuration file","JSON files with model outputs","Model names and paths"],"output_types":["Leaderboard CSV/JSON","Detailed comparison results","Evaluation logs and statistics"],"categories":["automation-workflow","tool-use-integration"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_7","uri":"capability://data.processing.analysis.instruction.dataset.management.with.built.in.alpacaeval.benchmark","name":"instruction dataset management with built-in alpacaeval benchmark","description":"Provides a curated dataset of 805 instruction-following examples designed to evaluate general-purpose LLM instruction-following ability. The dataset is included with the package and can be loaded programmatically or via CLI. Includes instructions across diverse categories (writing, math, coding, reasoning) with varying difficulty levels. Supports custom instruction datasets by accepting JSON/JSONL files with 'instruction' and optional 'reference_output' fields.","intents":["Evaluate models on a standard benchmark for reproducible comparisons","Use custom instruction datasets for domain-specific evaluation","Analyze model performance across instruction categories","Benchmark against published AlpacaEval leaderboard results"],"best_for":["Researchers wanting to compare results against published AlpacaEval leaderboards","Teams evaluating general-purpose instruction-following ability","Organizations with domain-specific instructions wanting to extend AlpacaEval"],"limitations":["Built-in dataset is English-only; no multilingual evaluation support","Dataset is fixed at 805 examples; no automatic expansion or dynamic dataset generation","Instructions are general-purpose; specialized domains (medical, legal) may need custom datasets","No built-in instruction quality validation; custom datasets may contain low-quality or ambiguous instructions"],"requires":["AlpacaEval package installed","For custom datasets: JSON/JSONL file with 'instruction' field"],"input_types":["Built-in dataset (automatic)","Custom JSON/JSONL with instruction field","Optional: reference outputs for comparison"],"output_types":["Instruction list (JSON)","Per-instruction evaluation results","Category-wise performance breakdown"],"categories":["data-processing-analysis","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_8","uri":"capability://automation.workflow.caching.system.for.judge.responses.with.deduplication","name":"caching system for judge responses with deduplication","description":"Implements a file-based cache that stores judge LLM responses to avoid re-evaluating identical instruction pairs. The cache uses instruction and model output hashes as keys, enabling deduplication across multiple evaluation runs. When a cached result is found, the system returns the cached judgment without calling the judge LLM, reducing API costs and latency. Cache can be cleared or inspected via CLI commands.","intents":["Avoid re-evaluating the same model pair when running evaluation multiple times","Reduce API costs by caching expensive judge LLM calls","Speed up evaluation by reusing cached judgments from prior runs","Inspect cached results to debug judge behavior"],"best_for":["Teams running frequent evaluations on overlapping model sets","Cost-conscious organizations wanting to minimize API spend","Researchers iterating on evaluation methodology and reusing judge responses"],"limitations":["Cache is file-based and local; no distributed cache support for multi-machine evaluation","Cache keys are based on instruction and output hashes; any change to instruction text invalidates cache","No automatic cache invalidation; stale cached results can persist if judge model is updated","Cache size grows unbounded; no built-in cleanup or eviction policy"],"requires":["Writable filesystem for cache storage","Cache directory path (default: ~/.cache/alpaca_eval)"],"input_types":["Instruction text","Model outputs (both models being compared)"],"output_types":["Cached judge response (if hit)","Cache metadata (timestamp, judge model, cost)"],"categories":["automation-workflow","memory-knowledge"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__cap_9","uri":"capability://automation.workflow.retry.logic.and.error.handling.for.judge.api.calls","name":"retry logic and error handling for judge api calls","description":"Implements exponential backoff retry logic for failed judge API calls, with configurable retry counts and backoff parameters. Handles common failure modes: rate limiting (429), temporary service unavailability (5xx), and network timeouts. Failed requests are logged with context (instruction, models, error details) for debugging. Supports graceful degradation where partial evaluation results are returned if some comparisons fail.","intents":["Automatically recover from transient API failures without manual intervention","Handle rate limiting from judge LLM providers gracefully","Debug evaluation failures by inspecting error logs","Continue evaluation even if some comparisons fail"],"best_for":["Teams running long-running evaluations vulnerable to transient failures","Organizations with strict API rate limits requiring backoff strategies","Researchers needing robust evaluation pipelines that don't fail on single errors"],"limitations":["Exponential backoff can significantly increase evaluation time if rate limits are hit frequently","Retry logic is applied per-request; no global rate limiting across multiple evaluation runs","Failed comparisons are logged but not automatically retried in subsequent runs; requires manual re-run","No circuit breaker pattern; continuous failures will exhaust retry budget and halt evaluation"],"requires":["Judge API credentials (OpenAI, Anthropic, etc.)","Network connectivity to judge API"],"input_types":["Judge API request (instruction, model outputs)","Retry configuration (max retries, backoff factor)"],"output_types":["Judge response (on success)","Error log with retry details (on failure)","Partial results (if graceful degradation enabled)"],"categories":["automation-workflow","safety-moderation"],"confidence":0.5,"matches":0,"success_rate":0},{"id":"alpacaeval__headline","uri":"capability://testing.quality.automated.evaluation.framework.for.instruction.following.llms","name":"automated evaluation framework for instruction-following llms","description":"AlpacaEval is an automated evaluation framework specifically designed for instruction-following language models, utilizing LLMs as judges to provide scalable and cost-effective evaluations while controlling for length bias.","intents":["best automated evaluation framework","LLM evaluation for instruction-following tasks","how to evaluate language models automatically","cost-effective LLM evaluation solutions","scalable evaluation frameworks for AI models"],"best_for":["researchers evaluating LLMs","developers needing quick model assessments"],"limitations":["requires LLMs for evaluation","may not suit non-instruction tasks"],"requires":["access to instruction-following LLMs"],"input_types":["model outputs","reference outputs"],"output_types":["evaluation scores","rankings"],"categories":["testing-quality"],"confidence":0.5,"matches":0,"success_rate":0}],"trust":{"score":63,"verified":false,"data_access_risk":"high","permissions":["Python 3.8+","API key for OpenAI (gpt-4, gpt-3.5-turbo) OR Anthropic (claude-3) OR local model server (vLLM, Ollama)","Model outputs in JSON format with 'instruction', 'output' fields","Instruction dataset (AlpacaEval provides 805 instruction-following examples)","For OpenAI: OPENAI_API_KEY environment variable","For Anthropic: ANTHROPIC_API_KEY environment variable","For Hugging Face: HF_TOKEN environment variable","For vLLM: vLLM server running on localhost:8000 (configurable)","For Ollama: Ollama service running locally","Model outputs in JSON, JSONL, or CSV format"],"failure_modes":["Judge LLM quality directly impacts evaluation validity — weak judges (e.g., smaller open models) show lower correlation with human judgments","Pairwise comparison scales quadratically with model count; evaluating 20 models requires ~190 comparisons","Length-controlled win rate assumes length penalty is uniform across instruction types; some tasks may legitimately require longer responses","Requires API access to a capable judge model (GPT-4, Claude) or local inference infrastructure; cannot use weak local models as judges","Local model decoders (vLLM, Ollama) require GPU infrastructure and model weights; adds deployment complexity vs. API-only approach","Cache is in-memory or file-based; no distributed cache support for multi-machine evaluation","Decoder registry is hardcoded in constants.py; adding new providers requires code changes, not configuration","API rate limits from OpenAI/Anthropic can throttle evaluation speed; no built-in queue management","Validation is schema-based; cannot detect semantic errors (e.g., instruction-output mismatch)","Cleaning operations are lossy; aggressive normalization may remove intentional formatting","builder identity is not verified yet","no observed match outcomes yet"],"rank_breakdown":{"adoption":0.7,"quality":0.9,"ecosystem":0.39999999999999997,"match_graph":0.25,"freshness":0.52,"weights":{"adoption":0.25,"quality":0.35,"ecosystem":0.15,"match_graph":0.2,"freshness":0.05}},"observed_outcomes":{"matches":0,"success_rate":0,"avg_confidence":0,"top_intents":[],"last_matched_at":null},"maintenance":{"status":"active","updated_at":"2026-06-17T09:51:02.370Z","last_scraped_at":null,"last_commit":null},"community":{"stars":null,"forks":null,"weekly_downloads":null,"model_downloads":null,"model_likes":null}},"distribution":{"claim_url":"https://unfragile.ai/submit?claim=alpacaeval","compare_url":"https://unfragile.ai/compare?artifact=alpacaeval"}},"signature":"t1c2yZoVqN8WDyaYGSzLwtVJgv7ygRqE0KZodxGmEcRsqd4g3Hs92Qye2XNVe0Pe4hjWvLOx+D6B53eY3oQCAw==","signedAt":"2026-06-22T09:15:10.093Z","signedBy":"unfragile.ai","version":1},"_links":{"self":"https://unfragile.ai/api/v1/passport/alpacaeval","artifact":"https://unfragile.ai/alpacaeval","verify":"https://unfragile.ai/api/v1/verify?slug=alpacaeval","publicKey":"https://unfragile.ai/api/v1/trust-passport-public-key","spec":"https://unfragile.ai/trust","schema":"https://unfragile.ai/schema.json","docs":"https://unfragile.ai/docs"}}