OPUS
DatasetFreeMassive parallel corpus for machine translation.
Capabilities11 decomposed
multilingual parallel corpus discovery via searchable index
Medium confidenceProvides a web-based search interface that queries a database index across 1,214 distinct parallel corpora spanning 1,005 languages, allowing users to filter by language pair and corpus type to identify relevant training data. The discovery system aggregates metadata (sentence pair counts, corpus source, release dates) from heterogeneous sources including subtitles, institutional documents, and web crawls, presenting results ranked by corpus size and relevance.
Aggregates and indexes 1,214 distinct corpora from heterogeneous sources (subtitles, EU documents, web crawls, academic sources) into a unified searchable interface, rather than requiring users to visit individual corpus repositories. Maintains version tracking across releases (e.g., OpenSubtitles v2024 vs historical versions) and exposes corpus composition percentages relative to the full 102.9B sentence pair collection.
Broader corpus coverage (1,214 corpora, 1,005 languages) than single-source alternatives like OpenSubtitles alone, but lacks the quality filtering, alignment confidence scores, and API-based programmatic access that commercial MT platforms provide.
bulk parallel corpus download with source-specific formatting
Medium confidenceEnables download of aligned sentence pairs from selected corpora in their native format, aggregating data from 102.9 billion total sentence pairs across sources like OpenSubtitles (27.2B), NLLB (22.7B), CCMatrix (17.1B), and 1,209 additional corpora. Downloads are organized hierarchically by corpus and language pair, with file formats and encoding specifications determined by the source corpus (format specifications not explicitly documented in available materials).
Aggregates downloads from 1,214 distinct corpora with heterogeneous sources and formats into a unified interface, allowing single-point access to subtitle data (OpenSubtitles 27.2B pairs), institutional documents (EU Europarl 217.4M, DGT 1.2B), web-crawled data (CCMatrix 17.1B, ParaCrawl 4.6B), and domain-specific corpora (medical EMEA 282.5M, patents EuroPat 252.2M). Maintains version history with release tracking (e.g., OpenSubtitles v2024 released 2025-02-14).
Provides access to 102.9B sentence pairs across 1,005 languages in a single interface, whereas alternatives like individual corpus repositories require visiting multiple sites; however, lacks programmatic API access, quality filtering, and explicit licensing documentation that commercial MT data providers offer.
specialized domain corpus access (medical, patents, bible)
Medium confidenceProvides access to specialized domain-specific parallel corpora including EMEA (medical, 282.5M pairs), EuroPat (patents, 252.2M), and Bible translations (88.3M), enabling training of translation systems for specialized domains with domain-specific terminology and language patterns. These corpora are sourced from authoritative domain-specific documents and enable building translation systems for vertical markets.
Aggregates specialized domain-specific corpora including EMEA (medical, 282.5M pairs), EuroPat (patents, 252.2M), and Bible translations (88.3M), providing domain-specific parallel data for vertical markets. While small relative to general-domain corpora, these specialized sources enable training of domain-specific translation systems with domain-specific terminology and language patterns.
Provides centralized access to specialized domain corpora in a single interface, whereas accessing these sources individually requires visiting domain-specific repositories; however, limited domain coverage (only medical, patents, Bible) and small corpus sizes mean specialized MT platforms with broader domain coverage and larger domain-specific datasets are more suitable for most vertical markets.
domain-specific parallel corpus selection and filtering
Medium confidenceEnables users to identify and download parallel corpora organized by domain and source type, including subtitle-based data (OpenSubtitles, TED talks), institutional/legal documents (EU Europarl, JRC-Acquis, DGT), web-crawled general-domain data (CCMatrix, ParaCrawl, WikiMatrix), and specialized corpora (medical EMEA, patents EuroPat, Bible translations). The collection exposes corpus composition metadata allowing users to understand source characteristics and select data matching their domain requirements.
Curates domain-specific corpora including medical (EMEA 282.5M pairs), patents (EuroPat 252.2M), legal/institutional (Europarl 217.4M, JRC-Acquis 215.9M, DGT 1.2B), and specialized sources (Bible translations 88.3M, Ubuntu documentation) alongside general-domain subtitle and web-crawled data, enabling users to select data by source type and implied domain rather than explicit domain labels.
Provides access to specialized domain corpora (medical, legal, patents) in a single interface, whereas generic parallel corpus repositories focus on general-domain data; however, lacks explicit domain tagging, quality metrics per domain, and domain-specific preprocessing that specialized MT data providers offer.
multilingual corpus composition analysis and statistics
Medium confidenceExposes corpus-level metadata including total sentence pair counts, percentage of collection, source type, and release dates, enabling users to understand the composition and scale of available parallel data. Provides aggregate statistics showing that top 10 corpora account for ~93.5% of total data, with detailed breakdowns for major sources (OpenSubtitles 27.2B/26.47%, NLLB 22.7B/22.09%, CCMatrix 17.1B/16.61%, ParaCrawl 4.6B/4.50%).
Aggregates and exposes composition statistics across 1,214 corpora totaling 102.9B sentence pairs, showing that top 10 corpora represent ~93.5% of data and identifying the long tail of 1,200+ corpora with minimal coverage. Provides per-corpus metadata (sentence pair counts, percentages, release dates) enabling data-driven selection, rather than requiring users to assess corpus sizes individually.
Offers transparent composition statistics across a large aggregated collection, whereas individual corpus repositories provide only their own metrics; however, lacks per-language-pair breakdowns, quality-weighted statistics, and temporal trend analysis that research-focused data platforms provide.
version-tracked corpus releases with historical access
Medium confidenceMaintains version history for major corpora with explicit release dates, enabling users to access specific versions for reproducibility and comparative analysis. Tracks releases including OpenSubtitles v2024 (released 2025-02-14), HPLT and MultiHPLT v2 (released 2025-01-25), and historical versions back to 2017, allowing researchers to reproduce results with the same data version used in prior work.
Explicitly tracks and maintains version history for major corpora with release dates (e.g., OpenSubtitles v2024 released 2025-02-14, HPLT v2 released 2025-01-25), enabling reproducible research and comparative analysis across versions. Provides historical access to corpus versions dating back to 2017, rather than only offering the latest version.
Enables version-based reproducibility for major corpora, whereas many corpus repositories only provide the latest version; however, lacks detailed changelogs, automated version management, and integration with ML experiment tracking tools that research platforms like Hugging Face Datasets provide.
low-resource and rare language pair data aggregation
Medium confidenceAggregates parallel data for 1,005 languages including low-resource and endangered languages, though with highly uneven coverage. Provides access to specialized multilingual corpora (MultiHPLT 2.7B pairs, MultiParaCrawl 2.8B, MultiCCAligned 2.4B) designed to cover broader language sets, alongside language-specific corpora for rare pairs. However, the long tail of 1,200+ corpora with minimal coverage means many language pairs have severely limited data.
Aggregates data for 1,005 languages including low-resource and endangered languages, with specialized multilingual corpora (MultiHPLT 2.7B, MultiParaCrawl 2.8B, MultiCCAligned 2.4B) designed to provide broader language coverage. However, coverage is highly uneven with top 3 corpora representing 65.17% of data, meaning most rare language pairs have minimal or zero coverage.
Provides access to 1,005 languages in a single interface, whereas most MT platforms focus on high-resource pairs; however, the uneven distribution and lack of explicit language pair availability matrix make it difficult to assess coverage for specific rare pairs, and data quality for low-resource languages is undocumented.
institutional and legal document parallel corpus access
Medium confidenceProvides access to large-scale institutional and legal parallel corpora sourced from EU documents and similar official sources, including Europarl (217.4M pairs), JRC-Acquis (215.9M), DGT (1.2B), and similar sources. These corpora contain formal, high-quality aligned sentence pairs from official multilingual documents, suitable for training translation systems on institutional and legal language.
Aggregates large-scale institutional and legal parallel corpora from EU sources (Europarl 217.4M, JRC-Acquis 215.9M, DGT 1.2B) providing high-quality formal language data from official multilingual documents. DGT corpus alone (1.2B pairs) represents 1.17% of total OPUS collection, making institutional data a significant component of the aggregation.
Provides centralized access to EU institutional corpora in a single interface, whereas accessing these sources individually requires navigating multiple government and institutional repositories; however, lacks domain-specific filtering, quality metrics, and documentation of preprocessing applied to institutional documents.
web-crawled general-domain parallel corpus aggregation
Medium confidenceAggregates large-scale web-crawled general-domain parallel corpora including CCMatrix (17.1B pairs, 16.61% of collection), ParaCrawl (4.6B pairs, 4.50%), and WikiMatrix (933.6M pairs), providing broad-coverage training data sourced from web documents and Wikipedia. These corpora enable training of general-purpose translation systems covering diverse topics and language styles extracted from web sources.
Aggregates CCMatrix (17.1B pairs, 16.61% of collection), ParaCrawl (4.6B pairs, 4.50%), and WikiMatrix (933.6M pairs) providing 22.6B+ web-crawled and Wikipedia-based parallel sentences. CCMatrix alone is the third-largest corpus in OPUS, making web-crawled data a dominant component of the aggregation alongside subtitles and institutional sources.
Provides centralized access to multiple large-scale web-crawled corpora in a single interface, whereas accessing these sources individually requires visiting separate repositories; however, lacks quality filtering, deduplication across sources, and documentation of alignment confidence that specialized MT data providers offer.
subtitle-based conversational language parallel corpus access
Medium confidenceProvides access to large-scale subtitle-based parallel corpora including OpenSubtitles (27.2B pairs, 26.47% of collection), TED2020 (153.1M), and NeuLab-TedTalks (79.7M), sourced from movie and TV subtitles and transcribed talks. These corpora contain conversational, informal language suitable for training translation systems on spoken language, dialogue, and informal registers.
Aggregates OpenSubtitles (27.2B pairs, 26.47% of collection — the single largest corpus in OPUS), TED2020 (153.1M), and NeuLab-TedTalks (79.7M), providing 27.4B+ subtitle and talk-based parallel sentences. OpenSubtitles alone represents over one-quarter of the entire OPUS collection, making subtitle-based data the dominant component.
Provides centralized access to the world's largest subtitle corpus (OpenSubtitles) alongside other talk-based data in a single interface, whereas accessing OpenSubtitles individually requires visiting its dedicated repository; however, lacks quality filtering, OCR error detection, and formal language alternatives that specialized MT platforms offer.
nllb multilingual machine translation training data access
Medium confidenceProvides access to NLLB (No Language Left Behind) corpus containing 22.7B aligned sentence pairs (22.09% of OPUS collection), a large-scale multilingual dataset created by Meta for training translation models covering 200+ languages. The NLLB corpus represents a significant component of OPUS and enables training of multilingual translation systems with broad language coverage.
Aggregates the NLLB (No Language Left Behind) corpus containing 22.7B aligned sentence pairs (22.09% of OPUS collection), a large-scale multilingual dataset created by Meta for training translation models covering 200+ languages. NLLB is the second-largest corpus in OPUS after OpenSubtitles, making it a primary source for multilingual training data.
Provides centralized access to Meta's NLLB multilingual dataset in a single interface, whereas accessing NLLB directly requires navigating Meta's repositories; however, lacks documentation of language coverage, preprocessing methodology, and integration with other multilingual corpora that Meta's official NLLB documentation provides.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with OPUS, ranked by overlap. Discovered automatically through the match graph.
mC4
Multilingual web corpus covering 101 languages.
paraphrase-multilingual-MiniLM-L12-v2
sentence-similarity model by undefined. 4,39,47,771 downloads.
Meilisearch
Lightning-fast search engine with vector search.
gte-multilingual-base
sentence-similarity model by undefined. 24,53,432 downloads.
Chat with Docs
Transform documents into interactive, conversational...
multilingual-e5-base
sentence-similarity model by undefined. 36,60,082 downloads.
Best For
- ✓machine translation researchers evaluating data availability
- ✓NLP practitioners building translation systems for specific language pairs
- ✓linguists conducting multilingual corpus analysis
- ✓organizations assessing data coverage before committing to MT projects
- ✓machine translation researchers training custom models
- ✓organizations building translation systems with specific domain requirements
- ✓NLP practitioners needing large-scale parallel data for fine-tuning
- ✓academic teams conducting multilingual NLP research
Known Limitations
- ⚠Language pair availability is sparse — 1,005 languages but only 1,214 corpora total means many language pairs have zero or minimal coverage
- ⚠No explicit language pair availability matrix provided — users must search individually to determine if a specific pair exists
- ⚠Search interface does not expose alignment confidence scores, quality metrics, or preprocessing applied to sentence pairs
- ⚠Cannot filter by data quality, domain, or temporal characteristics — only by corpus name and language pair
- ⚠File format specifications not documented — unclear whether downloads are provided as parallel files, TMX, Moses format, or plain text
- ⚠No API or programmatic download interface documented — appears to require manual web interface interaction
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Open parallel corpus collection containing billions of aligned sentences across hundreds of language pairs sourced from subtitles, EU documents, and web crawls, serving as the foundation for machine translation research.
Categories
Alternatives to OPUS
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of OPUS?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →