Expert Verified Question Dataset With Contamination Detection

1

LiveCodeBenchBenchmark63/100

via “contamination-evidence-analysis-and-reporting”

Continuously updated coding benchmark — new competitive programming problems, prevents contamination.

Unique: Provides concrete, evidence-based contamination detection by analyzing performance degradation at model training cutoffs, rather than relying on external audits or data provenance tracking. DeepSeek models' 'stark drop in performance on LeetCode problems released since September 2023' provides clear evidence of contamination that would be missed by static benchmarks.

vs others: More practical and automated than manual data audits because it uses temporal analysis to detect contamination automatically; more reliable than relying on model developers' claims about training data because it provides empirical evidence.

2

Humanity's Last ExamBenchmark61/100

via “expert-curated multidisciplinary exam question compilation”

Hardest exam questions from thousands of experts.

Unique: Implements post-hoc contamination mitigation through a formal bug bounty program (03/21/2025) that identified and replaced searchable questions before finalization, addressing a critical gap in benchmark validity that most static benchmarks ignore. The collaborative curation model involves 100+ named contributors from diverse institutions rather than a single lab, creating distributed expertise validation.

vs others: Differs from static benchmarks (MMLU, ARC) by actively removing known contamination via bug bounty rather than assuming training data isolation; differs from rolling benchmarks (HELM) by providing a fixed 2,500-question snapshot with explicit Nature publication (01/28/2026) rather than continuous updates.

3

LiveBenchBenchmark61/100

via “temporal metadata tracking and contamination risk reporting”

Continuously updated contamination-free LLM benchmark.

Unique: Implements comprehensive temporal metadata tracking with automated contamination risk reporting that flags model-question pairs where publication dates precede training cutoffs, providing transparent data leakage assessment

vs others: Provides explicit contamination risk visibility that static benchmarks lack, enabling researchers to filter results by contamination status and make evidence-based decisions about model comparisons

4

GPQARepository58/100

via “expert-verified question dataset with contamination detection”

Graduate-level expert QA — unsearchable questions in biology, physics, chemistry for deep reasoning.

Unique: Includes a canary string (unique identifier) embedded in each question for detecting data contamination in model training, enabling researchers to identify whether models have memorized benchmark questions. Questions are explicitly verified to be unsearchable via web search, ensuring that high performance requires genuine reasoning rather than information retrieval.

vs others: More rigorous than generic QA benchmarks because questions are expert-written and verified to be unsearchable, whereas many benchmarks (e.g., SQuAD) can be answered by simple web search or pattern matching, making them less useful for evaluating true reasoning ability.

5

OLMoModel57/100

via “test set contamination detection via decon”

Allen AI's fully open and transparent language model.

Unique: Dedicated tool (Decon) for detecting test set contamination released as part of training infrastructure, addressing a critical reproducibility issue in language model research. Enables transparent auditing of training data for benchmark overlap, supporting research integrity. Fully reproducible methodology allows verification of contamination detection.

vs others: More transparent than proprietary models (contamination detection methodology fully released) but lacks published analysis of contamination in OLMo training data and no comparison to alternative contamination detection approaches.

6

DS-1000Dataset57/100

via “data contamination avoidance through surface-level problem perturbation”

1,000 data science problems across 7 Python libraries.

Unique: Explicitly addresses data contamination risk through controlled perturbations rather than ignoring the problem or using completely synthetic data. Preserves authentic problem semantics and solution logic while modifying surface text, enabling safe evaluation of models trained on web-scale data.

vs others: More practical than synthetic benchmarks because it maintains real-world problem characteristics, while being more rigorous than unperturbed StackOverflow data because it mitigates contamination risks for models trained on web-scale corpora

Top Matches

Also Known As

Company