lm-evaluation-harness
RepositoryFreeEleutherAI's evaluation framework — 200+ benchmarks, powers Open LLM Leaderboard.
Capabilities15 decomposed
multi-backend language model instantiation with unified interface
Medium confidenceProvides a registry-based abstraction layer that instantiates language models from 25+ backends (HuggingFace, vLLM, OpenAI, Anthropic, local Ollama, etc.) through a single Python API. The registry pattern decouples task definitions from model implementations, allowing users to swap backends without changing evaluation code. Each backend implements a common interface supporting loglikelihood scoring and text generation with automatic tokenization, BOS token handling, and context window management.
Uses a pluggable registry system (lm_eval/api/registry.py) where each backend implements a common LM interface with automatic BOS token handling, tokenizer management, and context window validation. Unlike frameworks that require separate evaluation scripts per backend, this centralizes backend logic while preserving backend-specific optimizations (e.g., vLLM's paged attention).
Supports more backends (25+) than alternatives like LM-Eval-Lite or custom evaluation scripts, and provides unified loglikelihood + generation interface that alternatives often split across separate tools
yaml-based task definition with inheritance and templating
Medium confidenceEnables users to define evaluation tasks declaratively via YAML configuration files with support for Jinja2 templating, task inheritance, and document processing. Tasks specify prompts, few-shot examples, metrics, and answer extraction logic without writing Python code. The TaskManager loads YAML configs, resolves inheritance chains, and instantiates Task objects that generate evaluation requests. This approach separates task logic from evaluation infrastructure, allowing non-engineers to create benchmarks.
Implements a hierarchical task configuration system where YAML tasks can inherit from parent tasks, override specific fields, and use Jinja2 templating for dynamic prompt generation. The TaskManager resolves inheritance chains and merges configurations, enabling task reuse across 200+ benchmarks. Document processing pipeline (lm_eval/api/task.py) handles dataset loading, few-shot sampling, and prompt rendering in a single pass.
More declarative and maintainable than hardcoded Python task classes; supports inheritance and templating that alternatives like HELM or LM-Eval-Lite lack, reducing duplication across similar tasks
benchmark suite composition and aggregation
Medium confidenceEnables grouping of related tasks into benchmark suites (e.g., MMLU, BigBench, HELM) with aggregated metrics and reporting. Suites can be defined in YAML or Python, with support for task groups, weighted aggregation, and suite-level metrics. The system computes both per-task and suite-level results, with confidence intervals propagated through aggregation. Supports standard NLP benchmarks, multilingual benchmarks, robustness frameworks (SCORE), and custom suites.
Provides a declarative suite definition system where tasks can be grouped with optional weights and aggregation methods. The system automatically computes per-task and suite-level metrics, with confidence intervals propagated through aggregation. Supports both standard benchmarks (MMLU, BigBench) and custom suites defined in YAML or Python.
Supports weighted aggregation and custom suite composition, whereas alternatives typically report only per-task results; integrates suite definition into the evaluation framework rather than requiring external aggregation scripts
custom task definition via python classes with metric registration
Medium confidenceAllows advanced users to define evaluation tasks as Python classes extending the Task base class, with custom metric functions and request generation logic. Custom tasks can implement arbitrary evaluation logic beyond YAML capabilities, including complex metrics, multi-stage evaluation, and dynamic request generation. Metrics are registered in a global registry and can be reused across tasks. This provides maximum flexibility for researchers designing novel evaluation approaches.
Provides a Task base class that users can extend to implement custom evaluation logic, with automatic registration in the global task registry. Custom tasks can override request generation, metric computation, and result aggregation. Metrics are registered separately and can be reused across tasks, enabling modular metric development.
Enables arbitrary Python logic for task definition and metrics, whereas YAML-based tasks are limited to built-in capabilities; integrates custom tasks into the evaluation pipeline with automatic batching and caching support
model-agnostic evaluation with tokenizer abstraction
Medium confidenceAbstracts tokenizer differences across models by providing a unified tokenization interface that handles special tokens, padding, and attention masks consistently. The system automatically selects the correct tokenizer for each model backend and applies model-specific token handling (e.g., BOS token prepending for certain models). This enables fair comparison across models with different tokenization schemes, which would otherwise produce different loglikelihood scores for identical prompts.
Implements a tokenizer abstraction layer that automatically selects and applies the correct tokenizer for each model backend, with special handling for BOS tokens and model-specific quirks. The system tests BOS token handling empirically (lm_eval/models/test_bos_handling.py) to detect and correct for model-specific behavior, ensuring fair loglikelihood comparison across models.
Provides automatic BOS token handling and tokenizer selection, whereas alternatives require manual configuration; includes empirical BOS testing to detect model-specific behavior
command-line interface with flexible task and model specification
Medium confidenceProvides a comprehensive CLI (lm_eval/__main__.py) that accepts task names, model names, and evaluation parameters as command-line arguments. Supports task filtering (e.g., 'mmlu_*' to run all MMLU variants), model specification with backend selection, and output format configuration. The CLI integrates all framework capabilities (batching, caching, distributed evaluation, logging) without requiring Python code, making the framework accessible to non-programmers.
Provides a full-featured CLI that exposes all framework capabilities without requiring Python code. Supports task filtering with glob patterns (e.g., 'mmlu_*'), model specification with backend selection, and flexible output configuration. The CLI integrates batching, caching, distributed evaluation, and multi-sink logging.
More comprehensive CLI than alternatives like simple evaluation scripts; supports task filtering, model selection, and output configuration in a single command
benchmark suite composition and leaderboard aggregation
Medium confidenceEnables creation of custom benchmark suites by composing multiple tasks and aggregating their metrics into a single leaderboard score. The system supports weighted aggregation (e.g., MMLU counts more than HellaSwag), per-task metric selection, and hierarchical grouping (e.g., 'reasoning' group contains multiple reasoning tasks). Leaderboard scores are computed with optional normalization and ranking.
Supports weighted aggregation of metrics across multiple tasks with hierarchical grouping. Leaderboard scores are computed with optional normalization, enabling fair comparison across models with different evaluation configurations.
Compared to manual leaderboard computation, the framework automates aggregation and ranking. Weighted aggregation enables custom benchmark suites tailored to specific evaluation goals.
few-shot example sampling with stratification and caching
Medium confidenceImplements configurable few-shot sampling strategies that select examples from the training set to include in prompts. Supports random sampling, stratified sampling (balanced across classes), and deterministic seeding for reproducibility. The system caches sampled examples to avoid recomputation and integrates with the request generation pipeline to prepend examples to each evaluation instance. Sampling respects task-specific constraints (e.g., max tokens, example diversity).
Integrates few-shot sampling directly into the request generation pipeline with built-in caching and stratification support. The system computes sampling once per task, caches results, and reuses them across all evaluation instances. Stratified sampling uses class labels to ensure balanced representation, which is critical for imbalanced datasets where random sampling might miss minority classes.
Provides stratified sampling (not just random) and automatic caching that alternatives like simple prompt engineering lack; integrates sampling into the evaluation pipeline rather than requiring manual example selection
loglikelihood and text generation request generation with batching
Medium confidenceGenerates two types of evaluation requests from task definitions: (1) loglikelihood scoring, which computes the model's probability of correct answers given prompts, and (2) text generation, which samples model outputs and compares them to references. The request generator creates batches of requests optimized for the target model backend, handling tokenization, padding, and attention mask generation. Requests are cached to avoid recomputation across multiple evaluation runs.
Implements a two-stage request generation pipeline: (1) logical request creation from task instances, and (2) physical batching optimized for the target backend. The system automatically groups requests into batches, handles variable-length sequences with padding, and caches results. Loglikelihood requests support both continuation scoring (P(answer|prompt)) and prefix scoring (P(prompt+answer)).
Unified handling of both loglikelihood and generation requests in a single pipeline, with automatic batching and caching that alternatives require manual implementation for; supports backend-specific optimizations (e.g., vLLM's token reuse)
metric computation with bootstrapped confidence intervals
Medium confidenceComputes evaluation metrics (accuracy, F1, BLEU, ROUGE, etc.) from model predictions and references, with automatic bootstrapped confidence interval calculation. The metrics system supports both built-in metrics (via lm_eval/api/metrics.py) and custom metric functions. Bootstrapping resamples predictions with replacement to estimate metric variance and generate 95% confidence intervals, providing statistical rigor beyond point estimates. Metrics are aggregated at task and suite levels.
Integrates bootstrapped confidence interval computation directly into the metrics pipeline, automatically resampling predictions to estimate metric variance. The system supports both built-in metrics (accuracy, F1, BLEU, ROUGE) and custom metric functions, with aggregation at task and suite levels. Bootstrapping is configurable (default 100k iterations) and cached to avoid recomputation.
Provides confidence intervals by default (not optional), which alternatives like simple accuracy reporting lack; bootstrapping approach is more robust than analytical CI formulas for non-normal distributions
distributed and multi-gpu evaluation with automatic load balancing
Medium confidenceEnables evaluation across multiple GPUs and distributed systems using PyTorch's DistributedDataParallel or manual batching strategies. The evaluator automatically partitions tasks across devices, balances load based on task complexity, and aggregates results. Supports both data parallelism (same model on multiple GPUs) and model parallelism (model sharded across GPUs). Caching prevents redundant computation across devices, and results are synchronized before final aggregation.
Implements automatic load balancing across GPUs by partitioning tasks based on estimated complexity (dataset size, model size). The system uses PyTorch's DistributedDataParallel for data parallelism and supports manual device assignment for model parallelism. Caching is synchronized across devices using file locks to prevent redundant computation while avoiding race conditions.
Provides automatic load balancing and device management that alternatives require manual configuration for; integrates with vLLM and other backends that natively support tensor parallelism
chat template and multi-turn prompt formatting
Medium confidenceHandles formatting of multi-turn conversations using model-specific chat templates (e.g., ChatML, Llama 2 chat format). The system applies templates to convert task prompts into properly formatted chat messages, handling role assignment (user/assistant), special tokens, and message ordering. Templates are loaded from HuggingFace model configs or defined in task YAML. This enables evaluation of instruction-tuned and chat models with their native prompt formats.
Integrates chat template application directly into the request generation pipeline, automatically detecting and applying model-specific formats from HuggingFace configs. The system handles role assignment, special token insertion, and message ordering according to each model's template. Supports both built-in templates and custom definitions in task YAML.
Automatically detects and applies model-specific chat templates from HuggingFace configs, whereas alternatives require manual template specification; supports multi-turn conversations natively
response filtering and answer extraction with regex and parsing
Medium confidenceExtracts model answers from generated text using task-specific filters and parsers. Supports regex-based extraction, JSON parsing, multiple-choice option selection, and custom Python functions. Filters handle common cases like extracting the first line, finding the last number, or parsing structured outputs. This decouples answer extraction logic from metric computation, allowing flexible handling of different model output formats.
Provides a pluggable filter system where each task can define custom extraction logic via regex, JSON parsing, or Python functions. Filters are applied in sequence with fallback strategies, allowing graceful degradation if primary extraction fails. The system logs extraction failures for debugging and supports multiple valid answer formats.
Supports multiple extraction strategies with fallbacks, whereas alternatives typically use single-strategy extraction; integrates extraction into the evaluation pipeline rather than requiring post-processing
result logging and persistence with multi-sink support
Medium confidenceLogs evaluation results to multiple destinations simultaneously: JSON files, Weights & Biases, HuggingFace Hub, Zeno visualization platform, and custom sinks. The EvaluationTracker (lm_eval/loggers/evaluation_tracker.py) manages result aggregation and formatting for each sink. Results include metrics, confidence intervals, task metadata, model info, and evaluation parameters. Logging is asynchronous to avoid blocking evaluation, and results are cached to enable resumable evaluations.
Implements a multi-sink logging architecture where results are formatted and sent to multiple destinations (JSON, W&B, HuggingFace Hub, Zeno) simultaneously. The EvaluationTracker aggregates results and handles sink-specific formatting. Logging is asynchronous and decoupled from evaluation, allowing evaluation to proceed while results are uploaded.
Supports simultaneous logging to multiple platforms (W&B, HuggingFace Hub, Zeno) in a single pipeline, whereas alternatives typically support one platform; integrates with HuggingFace Hub for Open LLM Leaderboard submission
caching system with request deduplication and result reuse
Medium confidenceImplements a multi-level caching system that stores model outputs, loglikelihoods, and metrics to avoid redundant computation. Caches are keyed by model name, task name, and request hash, enabling result reuse across evaluation runs. The system supports both in-memory and disk-based caches, with automatic cache invalidation when task definitions change. Caching is transparent to users and significantly reduces evaluation time for repeated benchmarks.
Implements transparent, multi-level caching keyed by model name, task name, and request hash. The system automatically deduplicates requests and reuses results across evaluation runs. Caches are stored on disk with optional in-memory layer, and cache invalidation is triggered by task definition changes (detected via hash comparison).
Provides transparent caching without user intervention, whereas alternatives require manual result management; supports both in-memory and disk-based caches with automatic deduplication
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with lm-evaluation-harness, ranked by overlap. Discovered automatically through the match graph.
MAP-Neo
Fully open bilingual model with transparent training.
Local GPT
Chat with documents without compromising privacy
Wordware
Build better language model apps, fast.
gpt4all
A chatbot trained on a massive collection of clean assistant data including code, stories and dialogue.
WMDP
Benchmark for dangerous knowledge in LLMs.
Best For
- ✓researchers comparing models across different inference frameworks
- ✓teams evaluating both proprietary and open-source models in a single pipeline
- ✓organizations migrating from one inference backend to another
- ✓researchers designing new benchmarks without deep Python expertise
- ✓teams maintaining large task suites (200+ tasks) with shared configuration patterns
- ✓organizations needing version-controlled, human-readable task definitions
- ✓researchers publishing results on standard benchmarks (MMLU, BigBench, etc.)
- ✓teams creating custom benchmark suites for domain-specific evaluation
Known Limitations
- ⚠Backend-specific features (e.g., vLLM's tensor parallelism) require explicit configuration; abstraction doesn't auto-optimize
- ⚠API-based models (OpenAI, Anthropic) incur per-token costs; local models don't, creating cost asymmetry in benchmarks
- ⚠Tokenizer differences between backends can cause slight loglikelihood score variations even for identical models
- ⚠Complex metric computation (e.g., custom statistical tests) still requires Python; YAML is limited to built-in metrics
- ⚠Jinja2 templating adds parsing overhead (~5-10ms per task instantiation); not suitable for real-time prompt generation
- ⚠Task inheritance can create deep dependency chains that are hard to debug if misconfigured
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
EleutherAI's framework for evaluating language models. Supports 200+ benchmarks. The backend for Hugging Face's Open LLM Leaderboard. Features custom task definitions, few-shot evaluation, and batch processing.
Categories
Alternatives to lm-evaluation-harness
Are you the builder of lm-evaluation-harness?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →