Domain Specific Parallel Corpus Selection And Filtering

1

OPUSDataset59/100

via “domain-specific parallel corpus selection and filtering”

Massive parallel corpus for machine translation.

Unique: Curates domain-specific corpora including medical (EMEA 282.5M pairs), patents (EuroPat 252.2M), legal/institutional (Europarl 217.4M, JRC-Acquis 215.9M, DGT 1.2B), and specialized sources (Bible translations 88.3M, Ubuntu documentation) alongside general-domain subtitle and web-crawled data, enabling users to select data by source type and implied domain rather than explicit domain labels.

vs others: Provides access to specialized domain corpora (medical, legal, patents) in a single interface, whereas generic parallel corpus repositories focus on general-domain data; however, lacks explicit domain tagging, quality metrics per domain, and domain-specific preprocessing that specialized MT data providers offer.

2

mC4Dataset58/100

via “language-specific-corpus-filtering-and-subset-selection”

Multilingual web corpus covering 101 languages.

Unique: Provides language-partitioned Parquet files enabling efficient columnar filtering without full corpus download. Supports both batch download and streaming APIs, allowing researchers to work with language subsets at different scales (100MB to 300GB) without infrastructure overhead.

vs others: More flexible language selection than OSCAR (which requires manual filtering) and more scalable than downloading Wikipedia dumps per language, with built-in streaming for memory-constrained environments

3

C4 (Colossal Clean Crawled Corpus)Dataset57/100

via “large-scale english text corpus filtering and deduplication”

Google's cleaned Common Crawl corpus used to train T5.

Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples

vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring

Top Matches

Also Known As

Company