MTEB
BenchmarkFreeEmbedding model benchmark — 8 tasks, 112 languages, the standard for comparing embeddings.
Capabilities12 decomposed
multi-task embedding model evaluation across 8+ task types
Medium confidenceEvaluates embedding models against a standardized task hierarchy (AbsTask) that implements Classification, Clustering, PairClassification, Reranking, Retrieval, and STS tasks. Each task defines its own dataset, evaluation metrics, and task-specific logic, enabling consistent benchmarking across heterogeneous evaluation scenarios. The evaluation pipeline orchestrates model inference, metric computation, and result aggregation in a reproducible manner.
Implements a polymorphic task system where each task type (Retrieval, Classification, etc.) inherits from AbsTask and defines its own evaluation logic, metrics, and dataset handling. This allows MTEB to support 1000+ evaluation tasks across 10+ task types without duplicating evaluation code. Task metadata (language, domain, license) is standardized, enabling filtering and cross-cutting analysis.
Broader task coverage (8+ task types vs. single-task benchmarks like STS or BEIR) and standardized task interface enable fair comparison across heterogeneous evaluation scenarios, whereas most embedding benchmarks focus on retrieval-only evaluation.
multilingual and cross-lingual evaluation across 112+ languages
Medium confidenceSupports evaluation of embedding models across 112+ languages through language-aware task metadata and multilingual dataset variants. The task system stores language codes and domain information, enabling filtering of tasks by language and cross-lingual evaluation scenarios. Dataset loading automatically handles language-specific variants, and the evaluation pipeline preserves language context through metadata propagation.
Task metadata system stores language codes and domain information as first-class properties, enabling programmatic filtering and cross-lingual task selection. Datasets are loaded with language-aware variants, and the evaluation pipeline preserves language context through metadata propagation. This is distinct from benchmarks that treat language as a post-hoc filtering mechanism.
Covers 112+ languages with standardized task metadata vs. most embedding benchmarks (e.g., BEIR, STS) which are English-only or have limited multilingual coverage.
results storage, loading, and format standardization
Medium confidenceImplements a standardized results format (JSON with per-task metrics, model metadata, and evaluation metadata) that enables reproducible result storage and leaderboard integration. Results are stored locally or in a centralized repository (HuggingFace Hub). The results system handles versioning, caching, and format validation. Results can be loaded and compared programmatically, enabling post-hoc analysis and leaderboard generation.
Results are stored in a standardized JSON format with per-task metrics, model metadata, and evaluation metadata. Results can be stored locally or in a centralized repository (HuggingFace Hub). The results system handles versioning and format validation, enabling reproducible result storage and leaderboard integration. Results can be loaded and compared programmatically.
Standardized results format vs. ad-hoc result files, enabling reproducible storage and leaderboard integration. Centralized repository (HuggingFace Hub) vs. scattered result files, enabling easy discovery and comparison.
contribution system with point-based incentives for task and model additions
Medium confidenceImplements a contribution tracking system that awards points for adding new tasks, models, and datasets to MTEB. Contributors earn points based on the scope and quality of their contribution (e.g., new task type, multilingual task, large dataset). The system tracks contributions and displays them on contributor profiles. Points are used to recognize and incentivize community contributions, enabling MTEB to scale beyond core maintainers.
Contribution system awards points based on contribution type and scope (e.g., new task type, multilingual task, large dataset). Points are tracked and displayed on contributor profiles, providing recognition and incentivizing community contributions. This design enables MTEB to scale beyond core maintainers by leveraging community contributions.
Point-based incentive system vs. purely volunteer contributions, providing recognition and motivation for community contributors. Contribution tracking enables transparency and recognition of community impact.
standardized benchmark suite composition and execution
Medium confidenceProvides pre-defined benchmark suites (e.g., MTEB, RTEB) that group related tasks into coherent evaluation scenarios. The Benchmark class orchestrates task selection, model evaluation, and result aggregation. Benchmarks are composable — users can select specific task subsets, languages, or domains. The execution pipeline handles model loading, caching, and result serialization in a standardized format compatible with the leaderboard.
Benchmark class (in mteb/benchmarks/benchmark.py) provides composable task selection and standardized result formatting. Benchmarks are defined declaratively (e.g., MTEB includes specific task names and languages), and the execution pipeline handles model loading, caching, and result serialization. This enables reproducible benchmarking and leaderboard submission without custom scripting.
Standardized benchmark suites with pre-defined task composition vs. ad-hoc evaluation scripts, enabling reproducibility and leaderboard integration. Pre-defined benchmarks (MTEB, RTEB) reduce configuration burden compared to manually selecting tasks.
encoder protocol abstraction with multi-framework support
Medium confidenceDefines a unified encoder protocol that abstracts over different embedding model implementations (SentenceTransformers, instruction-based models, custom implementations). The protocol specifies encode() method signatures and handles batching, device management, and output normalization. Wrappers for SentenceTransformer and instruction-based models implement the protocol, enabling seamless integration of diverse model architectures without modifying evaluation code.
Encoder protocol (defined in mteb/models/encoder_interface.py) specifies a minimal encode() interface that abstracts over SentenceTransformer, instruction-based, and custom models. Wrappers (SentenceTransformerEmbedding, InstructionEmbedding) implement the protocol without modifying evaluation code. This enables pluggable model support and reduces coupling between model implementations and evaluation logic.
Unified encoder protocol vs. model-specific evaluation code, enabling new model architectures to be added without modifying the evaluation pipeline. Supports instruction-based models natively, whereas most benchmarks assume fixed model behavior.
task-specific metric computation and result aggregation
Medium confidenceImplements task-specific evaluators that compute metrics appropriate to each task type (e.g., NDCG for retrieval, F1 for classification, silhouette score for clustering). Metrics are computed per-task and aggregated into benchmark-level scores. The evaluation system supports custom metrics and handles edge cases (e.g., missing labels, ties in ranking). Results are serialized in a standardized format with per-task breakdowns and aggregate scores.
Task-specific evaluators inherit from a base evaluator class and implement compute() methods that handle metric calculation for each task type. Metrics are computed in-memory with caching to avoid redundant computation. Results are aggregated using a standardized format (JSON) that preserves per-task breakdowns and enables post-hoc analysis. This design separates metric logic from evaluation orchestration.
Task-specific evaluators vs. generic metric libraries (e.g., scikit-learn) ensure metrics are computed correctly for each task type. Standardized result format enables leaderboard integration and reproducible comparisons.
caching and performance optimization for large-scale evaluation
Medium confidenceImplements multi-level caching to reduce redundant computation: dataset caching (avoid re-downloading), embedding caching (avoid re-encoding), and result caching (avoid re-evaluating). The caching system uses local disk storage (configurable path) and checks cache validity based on model/task/dataset versions. Batching and device management optimize memory usage and inference speed. Progress tracking and logging enable monitoring of long-running evaluations.
Multi-level caching system (dataset, embedding, result caches) with version-based invalidation. Caching is transparent to evaluation code — users enable caching via configuration flags. Batching and device management are integrated into the encoder protocol, enabling efficient inference without explicit optimization code. Progress tracking uses tqdm for real-time monitoring.
Transparent caching vs. manual result management, reducing redundant computation and bandwidth usage. Multi-level caching (dataset, embedding, result) provides flexibility for different optimization scenarios.
interactive leaderboard with dynamic table generation and filtering
Medium confidenceProvides a web-based leaderboard (Streamlit app in mteb/leaderboard/app.py) that visualizes benchmark results with interactive filtering and sorting. The leaderboard loads results from a centralized repository, generates dynamic tables with configurable columns (metrics, languages, domains), and supports filtering by model, task, language, and benchmark. Table styling and figure generation enable publication-quality visualizations. The leaderboard is automatically updated when new results are submitted.
Streamlit-based leaderboard with dynamic table generation (mteb/leaderboard/table.py) that supports multi-level filtering (model, task, language, benchmark) and configurable column selection. Figures are generated on-the-fly using matplotlib/plotly. Leaderboard is automatically updated when new results are submitted to the results repository. This enables real-time result visualization without manual updates.
Interactive web-based leaderboard vs. static result tables or spreadsheets, enabling dynamic filtering and exploration. Supports multi-dimensional filtering (task, language, benchmark) vs. single-dimension leaderboards.
model metadata and model card generation
Medium confidenceMaintains standardized metadata for each model (architecture, training data, languages supported, license) and generates model cards compatible with Hugging Face. Model metadata is stored alongside results and used for filtering and comparison. The system supports custom metadata fields and enables model developers to provide context (e.g., training approach, known limitations). Model cards are generated from metadata and results, providing a comprehensive overview of model capabilities.
Model metadata system stores standardized fields (architecture, training data, languages, license) alongside results. Model cards are generated from metadata and results using templates, enabling Hugging Face Hub integration. Metadata is used for filtering and comparison in the leaderboard, providing context for interpreting results.
Standardized model metadata vs. ad-hoc documentation, enabling programmatic filtering and comparison. Model card generation reduces manual documentation burden.
command-line interface for benchmark execution and result submission
Medium confidenceProvides a CLI (mteb/cli.py) for running benchmarks, evaluating models, and submitting results without writing Python code. The CLI supports task selection, language filtering, model specification, and result output formatting. Commands include 'run' (execute benchmark), 'eval' (evaluate single task), 'submit' (submit results to leaderboard), and 'list' (show available tasks/models). The CLI integrates with the evaluation pipeline and handles result serialization.
CLI provides a thin wrapper over the Python API, enabling benchmark execution without custom scripts. Commands map to core operations (run, eval, submit, list) with configurable arguments and flags. Result output is formatted for leaderboard submission, reducing manual formatting burden. CLI integrates with the evaluation pipeline, inheriting caching and optimization benefits.
CLI-based execution vs. Python API, enabling non-programmers to run benchmarks. Standardized output format vs. ad-hoc result files, reducing submission errors.
extensible task system for adding new evaluation scenarios
Medium confidenceProvides a task framework (AbsTask base class) that enables developers to add new evaluation tasks without modifying core evaluation code. Tasks define dataset loading, metric computation, and task-specific logic through method overrides. The task registry enables dynamic task discovery and selection. Task metadata (language, domain, license) is standardized, enabling filtering and cross-cutting analysis. Documentation and examples guide task creation.
AbsTask base class defines a minimal interface (load_data, evaluate) that subclasses override to implement task-specific logic. Task registry enables dynamic task discovery and selection. Task metadata (language, domain, license) is standardized and used for filtering. This design separates task logic from evaluation orchestration, enabling new tasks to be added without modifying core code.
Extensible task framework vs. monolithic evaluation code, enabling new tasks to be added without modifying core logic. Task registry enables dynamic task discovery vs. static task lists.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MTEB, ranked by overlap. Discovered automatically through the match graph.
results
Dataset by mteb. 13,26,253 downloads.
multilingual-e5-small
sentence-similarity model by undefined. 70,32,108 downloads.
MAP-Neo
Fully open bilingual model with transparent training.
gte-multilingual-base
sentence-similarity model by undefined. 24,53,432 downloads.
multilingual-e5-base
sentence-similarity model by undefined. 36,60,082 downloads.
sentence-transformers
Embeddings, Retrieval, and Reranking
Best For
- ✓Embedding model developers validating generalization
- ✓ML researchers comparing embedding approaches
- ✓Teams selecting production embedding models
- ✓Teams building multilingual search or RAG systems
- ✓Researchers studying cross-lingual transfer in embeddings
- ✓Global companies evaluating models for non-English markets
- ✓Researchers publishing benchmark results
- ✓Teams tracking model performance over time
Known Limitations
- ⚠Task execution is sequential by default — no built-in parallelization across tasks
- ⚠Memory overhead scales with dataset size; large retrieval tasks may require batching
- ⚠Task-specific evaluators are tightly coupled to metric implementations — extending metrics requires subclassing
- ⚠Not all 1000+ tasks have multilingual variants — coverage varies by language (English has most, low-resource languages have fewer)
- ⚠Cross-lingual evaluation requires paired datasets (query language ≠ document language), which are limited to specific task types
- ⚠Language-specific metrics (e.g., BERTScore for morphologically rich languages) are not task-specific — uses generic metrics
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Massive Text Embedding Benchmark. Evaluates embedding models across 8 tasks (retrieval, classification, clustering, reranking, etc.) in 112 languages. The standard for comparing embedding models. Leaderboard on Hugging Face.
Categories
Alternatives to MTEB
Are you the builder of MTEB?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →