via “domain-specific parallel corpus selection and filtering”
Massive parallel corpus for machine translation.
Unique: Curates domain-specific corpora including medical (EMEA 282.5M pairs), patents (EuroPat 252.2M), legal/institutional (Europarl 217.4M, JRC-Acquis 215.9M, DGT 1.2B), and specialized sources (Bible translations 88.3M, Ubuntu documentation) alongside general-domain subtitle and web-crawled data, enabling users to select data by source type and implied domain rather than explicit domain labels.
vs others: Provides access to specialized domain corpora (medical, legal, patents) in a single interface, whereas generic parallel corpus repositories focus on general-domain data; however, lacks explicit domain tagging, quality metrics per domain, and domain-specific preprocessing that specialized MT data providers offer.