MTEB
BenchmarkFreeEmbedding model benchmark — 8 tasks, 112 languages, the standard for comparing embeddings.
Capabilities12 decomposed
multi-task embedding model evaluation across 8+ task types
Medium confidenceEvaluates embedding models against a standardized task hierarchy (AbsTask base class) spanning retrieval, classification, clustering, reranking, pair classification, and semantic textual similarity. Each task type implements task-specific evaluation logic with custom metrics, enabling models to be benchmarked across diverse embedding use cases in a single evaluation run. The framework abstracts task-specific scorer implementations while maintaining consistent metadata and result serialization.
Implements a polymorphic task system with AbsTask base class supporting 8+ task types, each with task-specific evaluators and metrics, rather than a single monolithic evaluation pipeline. This enables extensibility — new task types inherit from AbsTask and override evaluate() method while reusing metadata and result serialization infrastructure.
More comprehensive than single-task benchmarks (e.g., BEIR for retrieval only) by evaluating models across retrieval, classification, clustering, and reranking in one framework, reducing need for multiple separate evaluation tools.
multilingual and cross-lingual embedding evaluation across 112+ languages
Medium confidenceProvides language-aware task metadata and dataset selection enabling evaluation of embedding models across 112+ languages and cross-lingual scenarios. Tasks are tagged with language codes and domain information, allowing filtering and evaluation of multilingual models on language-specific or cross-lingual retrieval/classification tasks. The framework handles language-specific dataset loading and metric computation without requiring model-level language handling.
Embeds language metadata directly into task definitions (via task.languages property) and filters datasets by language code, enabling language-aware evaluation without requiring separate language-specific benchmark suites. Supports both monolingual and cross-lingual task variants within the same framework.
Covers 112+ languages across 8 task types, whereas most embedding benchmarks (BEIR, STS, etc.) focus on English-only evaluation or require separate multilingual variants.
result serialization and leaderboard-compatible output formatting
Medium confidenceSerializes evaluation results to standardized JSON format compatible with leaderboard ingestion, including model metadata, task results, metrics, and evaluation metadata (date, MTEB version). Results are stored in a hierarchical structure with per-task and aggregated metrics. The framework supports result loading from JSON files or Hugging Face Hub, enabling result sharing and leaderboard submission. Model cards can be automatically generated from results.
Implements standardized JSON result format with hierarchical structure (model metadata, per-task results, aggregated metrics) compatible with leaderboard ingestion. Results include evaluation metadata (date, MTEB version) enabling reproducibility and version tracking.
Provides standardized result format for leaderboard submission, whereas ad-hoc evaluation requires manual result formatting and validation.
command-line interface for batch evaluation and result management
Medium confidenceProvides a CLI (via Click or argparse) enabling batch evaluation of models on benchmarks without writing Python code. Supports commands for running benchmarks, submitting results, and viewing leaderboard results. CLI handles model loading, benchmark selection, result serialization, and optional leaderboard submission. Enables integration with CI/CD pipelines and automated evaluation workflows. Supports configuration files for reproducible evaluation setups.
Implements a Click-based CLI with commands for benchmark execution, result submission, and leaderboard viewing, enabling batch evaluation without Python code. Supports configuration files for reproducible setups and CI/CD integration.
Enables non-Python users and CI/CD systems to run MTEB evaluations via command line, whereas Python-only API requires custom scripts for each evaluation.
standardized benchmark suite composition and execution
Medium confidenceDefines pre-curated benchmark suites (e.g., MTEB, MTEB-Lite, RTEB) as collections of specific tasks with fixed configurations, enabling reproducible model comparisons across the community. Benchmarks are defined in mteb/benchmarks/benchmarks.py and can be retrieved via get_benchmark() API, which returns a Benchmark object containing task instances, metadata, and execution parameters. This abstraction decouples benchmark definition from evaluation logic.
Implements benchmark suites as first-class objects (Benchmark class) with metadata, task lists, and execution parameters, rather than ad-hoc task collections. Enables version-controlled benchmark definitions and leaderboard-compatible result formats through standardized Benchmark.run() interface.
Provides pre-defined, community-agreed benchmark suites (MTEB, MTEB-Lite, RTEB) with fixed task configurations, enabling fair model comparison on leaderboard, whereas ad-hoc benchmarking requires manual task selection and configuration.
encoder abstraction layer with model-agnostic evaluation
Medium confidenceDefines an encoder protocol (encode() method signature) that abstracts model-specific implementation details, enabling evaluation of any embedding model (SentenceTransformers, instruction-tuned models, custom implementations) through a unified interface. Models are wrapped in encoder classes (e.g., SentenceTransformerEncoder, InstructionBasedEncoder) that implement the protocol, handle batching, and manage model loading. This decouples task evaluation logic from model-specific code paths.
Implements a minimal encoder protocol (encode() method) rather than requiring model-specific adapters, enabling any model with a forward pass to be evaluated. Supports both standard and instruction-based models through separate encoder wrappers (SentenceTransformerEncoder vs. InstructionBasedEncoder) that handle task-specific prompting.
More flexible than framework-specific benchmarks (e.g., Hugging Face model evaluation) by supporting any model with an encode() method, including custom implementations, proprietary models, and non-standard architectures.
task-specific metric computation and result aggregation
Medium confidenceImplements task-specific evaluators (e.g., RetrievalEvaluator, ClassificationEvaluator, ClusteringEvaluator) that compute metrics appropriate to each task type using embeddings and ground truth labels. Metrics include NDCG, MAP, F1, NMI, and others depending on task. Results are aggregated per-task and across benchmarks, with support for weighted averaging and stratified analysis by language or domain. Results are serialized to standardized JSON format for leaderboard submission.
Implements polymorphic evaluators (RetrievalEvaluator, ClassificationEvaluator, etc.) that inherit from AbsEvaluator and override compute_metrics() with task-specific logic, enabling metric computation without duplicating evaluation code. Results are serialized to standardized JSON format compatible with leaderboard ingestion.
Provides task-specific metric implementations (NDCG for retrieval, F1 for classification, NMI for clustering) in a single framework, whereas generic evaluation libraries require manual metric selection and implementation per task type.
caching and performance optimization for embedding computation
Medium confidenceImplements caching mechanisms to avoid recomputing embeddings across multiple evaluation runs, storing embeddings in local cache (typically .cache/mteb_embeddings/) keyed by model name and dataset. Supports batch processing with configurable batch sizes to manage memory usage during encoding. Lazy loading of datasets from Hugging Face Hub with optional local caching reduces network overhead. These optimizations enable faster iteration during model development and reduce API calls for remote models.
Implements transparent embedding caching keyed by model name and dataset, with lazy dataset loading from Hugging Face Hub. Cache is automatically checked before encoding, reducing redundant computation across evaluation runs without requiring explicit cache management.
Reduces evaluation time for iterative model development by caching embeddings, whereas running MTEB without caching requires recomputing embeddings for every evaluation run.
interactive leaderboard with result visualization and filtering
Medium confidenceProvides a Streamlit-based web application (mteb/leaderboard/app.py) that visualizes benchmark results, enabling filtering by model, task, language, and domain. The leaderboard loads results from JSON files or Hugging Face Hub, renders interactive tables with sortable columns, and generates comparison plots. Benchmark selector UI allows users to switch between MTEB, MTEB-Lite, and RTEB suites. Results can be submitted via model cards and are automatically ingested into the leaderboard.
Implements a Streamlit-based leaderboard with dynamic filtering by model, task, language, and domain, rather than static HTML tables. Supports multiple benchmark suites (MTEB, MTEB-Lite, RTEB) with benchmark selector UI and automatic result ingestion from model cards.
Provides interactive filtering and comparison of embedding models across 112+ languages and 8+ task types in a single interface, whereas most model hubs (Hugging Face) lack task-specific filtering and multilingual evaluation visibility.
model registry and metadata management
Medium confidenceMaintains a centralized model registry (mteb/models.py) with metadata for 100+ embedding models, including model name, organization, parameters, and encoder wrapper type. Models are registered with their encoder class (SentenceTransformerEncoder, InstructionBasedEncoder, etc.) and can be retrieved via get_model() API. Model metadata includes embedding dimension, max sequence length, and instruction support, enabling automatic model selection and configuration.
Implements a centralized model registry with metadata (embedding dimension, max length, instruction support) and encoder wrapper types, enabling automatic model instantiation and configuration without manual setup. Registry is version-controlled in MTEB repository.
Provides a curated registry of 100+ embedding models with standardized metadata and encoder wrappers, whereas Hugging Face Hub requires manual model selection and custom encoder implementation.
dataset loading and preprocessing with hugging face hub integration
Medium confidenceLoads evaluation datasets from Hugging Face Hub (via datasets library) with automatic caching and preprocessing. Tasks define dataset identifiers and splits (train/test), and the framework handles downloading, caching, and formatting data into task-specific structures (queries, documents, labels). Supports lazy loading to minimize memory overhead. Datasets are cached locally to avoid repeated downloads, with optional offline mode for environments without internet access.
Integrates with Hugging Face Hub datasets library for automatic downloading and caching, with task-specific preprocessing logic defined in task classes. Supports lazy loading and offline mode, reducing memory overhead and enabling evaluation in air-gapped environments.
Provides automatic dataset loading and caching from Hugging Face Hub, whereas manual benchmarking requires downloading and preprocessing datasets separately for each task.
extensible task creation framework with abstask base class
Medium confidenceProvides an AbsTask base class that defines the task interface (load_data(), run(), evaluate() methods) and metadata structure (task name, description, languages, domains). New tasks inherit from AbsTask and override load_data() to define dataset loading and evaluate() to implement task-specific evaluation logic. Task metadata is automatically extracted and used for filtering, result aggregation, and leaderboard organization. This design enables community contributions of new tasks without modifying core evaluation logic.
Implements an extensible task framework with AbsTask base class that defines task interface and metadata structure, enabling community contributions without modifying core evaluation logic. Task metadata is automatically extracted and used for filtering, aggregation, and leaderboard organization.
Enables community-contributed tasks through a standardized interface, whereas monolithic benchmarks require core maintainer involvement for new task additions.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with MTEB, ranked by overlap. Discovered automatically through the match graph.
results
Dataset by mteb. 10,39,913 downloads.
multilingual-e5-base
sentence-similarity model by undefined. 29,31,013 downloads.
multilingual-e5-small
sentence-similarity model by undefined. 49,95,567 downloads.
gte-multilingual-base
sentence-similarity model by undefined. 24,36,647 downloads.
jina-embeddings-v3
feature-extraction model by undefined. 24,51,907 downloads.
leaderboard
leaderboard — AI demo on HuggingFace
Best For
- ✓Embedding model researchers comparing architectural approaches
- ✓ML practitioners selecting production embedding models
- ✓Teams building RAG systems needing multi-task performance validation
- ✓Teams building multilingual search or RAG systems
- ✓Researchers studying cross-lingual transfer in embeddings
- ✓Global companies selecting embedding models for multi-language support
- ✓Model developers submitting results to MTEB leaderboard
- ✓Researchers sharing evaluation results with the community
Known Limitations
- ⚠Task-specific metrics vary by task type — no single unified score across all 8 tasks
- ⚠Evaluation latency scales with model size and dataset size; large models may require GPU acceleration
- ⚠Custom task implementations require understanding AbsTask interface and metric definitions
- ⚠Language coverage varies by task — not all 112 languages available for all 8 task types
- ⚠Cross-lingual evaluation requires paired datasets in multiple languages, limiting task availability
- ⚠No built-in language detection — language must be specified at evaluation time
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Massive Text Embedding Benchmark. Evaluates embedding models across 8 tasks (retrieval, classification, clustering, reranking, etc.) in 112 languages. The standard for comparing embedding models. Leaderboard on Hugging Face.
Categories
Alternatives to MTEB
Build high-quality LLM apps - from prototyping, testing to production deployment and monitoring.
Compare →Amplication brings order to the chaos of large-scale software development by creating Golden Paths for developers - streamlined workflows that drive consistency, enable high-quality code practices, simplify onboarding, and accelerate standardized delivery across teams.
Compare →Are you the builder of MTEB?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →