Capability
3 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “domain-specific parallel corpus selection and filtering”
Massive parallel corpus for machine translation.
Unique: Curates domain-specific corpora including medical (EMEA 282.5M pairs), patents (EuroPat 252.2M), legal/institutional (Europarl 217.4M, JRC-Acquis 215.9M, DGT 1.2B), and specialized sources (Bible translations 88.3M, Ubuntu documentation) alongside general-domain subtitle and web-crawled data, enabling users to select data by source type and implied domain rather than explicit domain labels.
vs others: Provides access to specialized domain corpora (medical, legal, patents) in a single interface, whereas generic parallel corpus repositories focus on general-domain data; however, lacks explicit domain tagging, quality metrics per domain, and domain-specific preprocessing that specialized MT data providers offer.
via “language-specific-corpus-filtering-and-subset-selection”
Multilingual web corpus covering 101 languages.
Unique: Provides language-partitioned Parquet files enabling efficient columnar filtering without full corpus download. Supports both batch download and streaming APIs, allowing researchers to work with language subsets at different scales (100MB to 300GB) without infrastructure overhead.
vs others: More flexible language selection than OSCAR (which requires manual filtering) and more scalable than downloading Wikipedia dumps per language, with built-in streaming for memory-constrained environments
via “large-scale english text corpus filtering and deduplication”
Google's cleaned Common Crawl corpus used to train T5.
Unique: Uses deterministic heuristic-based filtering (length thresholds, keyword matching, language detection) applied at scale to 750GB of Common Crawl, enabling reproducible dataset creation without learned classifiers; includes sentence-level deduplication to remove redundant training examples
vs others: More transparent and reproducible than learned filtering approaches; larger and more thoroughly deduplicated than raw Common Crawl, but less sophisticated than newer datasets like Fineweb that use neural classifiers for quality scoring
Building an AI tool with “Domain Specific Parallel Corpus Selection And Filtering”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.