via “multilingual parallel corpus discovery via searchable index”
Massive parallel corpus for machine translation.
Unique: Aggregates and indexes 1,214 distinct corpora from heterogeneous sources (subtitles, EU documents, web crawls, academic sources) into a unified searchable interface, rather than requiring users to visit individual corpus repositories. Maintains version tracking across releases (e.g., OpenSubtitles v2024 vs historical versions) and exposes corpus composition percentages relative to the full 102.9B sentence pair collection.
vs others: Broader corpus coverage (1,214 corpora, 1,005 languages) than single-source alternatives like OpenSubtitles alone, but lacks the quality filtering, alignment confidence scores, and API-based programmatic access that commercial MT platforms provide.