Text Classification Dataset Sampling And Filtering

1

FineFineWebDataset23/100

Dataset by m-a-p. 4,59,057 downloads.

Unique: Leverages HuggingFace's native filtering and sampling APIs (via .filter() and .select()) to enable in-memory or streaming-based subset extraction without full corpus download; supports seed-based reproducibility for deterministic splits across experiments

vs others: More flexible than static benchmark datasets (ImageNet, MNIST) because filtering is dynamic and user-defined; faster iteration than manual annotation while maintaining reproducibility through versioned dataset snapshots

2

MINT-1T-PDF-CC-2023-40Dataset23/100

via “document-domain dataset sampling and filtering”

Dataset by mlfoundations. 8,57,357 downloads.

Unique: Provides streaming access with metadata-based filtering on trillion-token dataset without requiring full download, using Hugging Face Datasets infrastructure for efficient subset construction. Enables on-demand domain-specific corpus creation from larger collection.

vs others: More flexible than fixed-size domain datasets (e.g., ArXiv papers, legal documents) by allowing dynamic filtering from larger corpus; more efficient than downloading full dataset for subset access.

3

hd_tmpDataset22/100

via “dataset filtering and sampling for model training and evaluation”

Dataset by ayuo. 14,99,354 downloads.

Unique: Implements lazy filter evaluation using Apache Arrow's predicate pushdown, avoiding full dataset materialization; combines with stratified sampling for balanced subset creation without requiring pre-computed group labels

vs others: More memory-efficient than pandas-style filtering for large datasets, but less expressive than SQL queries for complex multi-condition filtering

4

V7Product

via “dataset-filtering-and-sampling”

Top Matches

Also Known As

Company