Capability
2 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “standardized multiple-choice evaluation harness”
7.8K science questions testing genuine reasoning, not just recall.
Unique: Provides a clean, standardized multiple-choice format with unique question identifiers and consistent answer choice ordering, enabling direct integration with evaluation frameworks like lm-eval, vLLM's evaluation suite, and Hugging Face's evaluation harness without custom parsing or normalization
vs others: More standardized than ad-hoc science QA datasets because it enforces consistent formatting; more reproducible than datasets with variable question structures or answer choice counts
via “adversarial-filtered multiple-choice evaluation”
70K commonsense reasoning questions with adversarial distractors.
Unique: Uses adversarial filtering where distractors are selected based on measured model confusion rather than human-written plausibility, creating a dataset that specifically targets machine weaknesses while maintaining human interpretability. This two-stage LLM-generation + human-validation approach is more scalable than purely human-written distractors while maintaining higher quality than random negatives.
vs others: Harder than SWAG (predecessor) because distractors are adversarially selected for model confusion, and more human-aligned than synthetic reasoning datasets because human accuracy (95.6%) validates that hard-for-models questions remain easy for humans.
Building an AI tool with “Adversarial Filtered Multiple Choice Evaluation”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.