via “bulk parallel corpus download with source-specific formatting”
Massive parallel corpus for machine translation.
Unique: Aggregates downloads from 1,214 distinct corpora with heterogeneous sources and formats into a unified interface, allowing single-point access to subtitle data (OpenSubtitles 27.2B pairs), institutional documents (EU Europarl 217.4M, DGT 1.2B), web-crawled data (CCMatrix 17.1B, ParaCrawl 4.6B), and domain-specific corpora (medical EMEA 282.5M, patents EuroPat 252.2M). Maintains version history with release tracking (e.g., OpenSubtitles v2024 released 2025-02-14).
vs others: Provides access to 102.9B sentence pairs across 1,005 languages in a single interface, whereas alternatives like individual corpus repositories require visiting multiple sites; however, lacks programmatic API access, quality filtering, and explicit licensing documentation that commercial MT data providers offer.