BIG-Bench Hard (BBH)
DatasetFree23 hardest BIG-Bench tasks where models initially failed.
Capabilities12 decomposed
chain-of-thought reasoning evaluation with few-shot examples
Medium confidenceProvides curated few-shot chain-of-thought (CoT) exemplars for 23 hard reasoning tasks, enabling models to learn structured step-by-step problem decomposition through in-context learning. Each task includes 3-5 hand-crafted examples showing intermediate reasoning steps, allowing models to adopt explicit reasoning patterns without fine-tuning. The dataset leverages prompt engineering patterns where models observe reasoning trajectories before solving novel instances.
Curated subset specifically filtered to tasks where models initially underperformed humans (below 50th percentile), creating a hard-mode benchmark rather than a balanced difficulty distribution. This selection strategy focuses evaluation on frontier model improvements rather than general capability assessment.
Harder and more reasoning-focused than general benchmarks like MMLU or HellaSwag; includes explicit CoT examples unlike raw BIG-Bench, making it more suitable for prompt engineering evaluation than raw task suites.
multi-domain reasoning task stratification
Medium confidenceOrganizes 23 tasks across distinct reasoning domains (algorithmic, arithmetic, logical, causal, spatial) with consistent evaluation structure, enabling fine-grained analysis of model strengths and weaknesses by reasoning type. Each task is independently evaluable with its own test set and metrics, allowing researchers to identify which reasoning modalities their models excel or fail at. The stratification enables targeted model development and capability analysis.
Explicitly stratifies tasks by reasoning modality (algorithmic, arithmetic, logical, causal, spatial) rather than treating all hard tasks as monolithic, enabling domain-specific capability assessment. This structure allows researchers to correlate model architecture choices with specific reasoning strengths.
More analytically useful than generic hard task collections because stratification enables root-cause analysis of reasoning failures; more focused than full BIG-Bench which lacks explicit domain organization.
frontier model capability benchmarking
Medium confidenceDesigned specifically to evaluate frontier language models (GPT-4, Claude, Llama 2+, etc.) on hard reasoning tasks where initial model performance was below human level, enabling measurement of model improvement over time and comparison of frontier model capabilities. The dataset enables researchers to track whether new model releases improve on hard reasoning and to identify reasoning capabilities that remain unsolved. Results are directly comparable across models because of standardized evaluation infrastructure.
Explicitly designed for frontier model evaluation by selecting tasks where initial models underperformed humans, creating a benchmark that remains challenging as models improve. This selection strategy ensures the benchmark is useful for measuring frontier model progress rather than becoming trivial.
More suitable for frontier model evaluation than general benchmarks because it focuses on hard reasoning tasks; more challenging than benchmarks where models already exceed human performance, which may not drive model improvement.
reproducible model evaluation and result comparison
Medium confidenceEnables reproducible evaluation across different models and research groups by providing standardized task definitions, test sets, evaluation metrics, and result aggregation. The dataset structure ensures that different teams can run identical evaluations and compare results directly, reducing evaluation variance and enabling fair model comparison. Standardized evaluation infrastructure supports publishing reproducible results and enables meta-analysis across multiple model evaluations.
Provides standardized evaluation infrastructure that enables reproducible results across different models and research groups, reducing evaluation variance and enabling fair model comparison. The dataset structure enforces consistent task definitions and metrics.
More reproducible than ad-hoc evaluation because it enforces standardized task definitions and metrics; more comparable than benchmarks without standardized infrastructure because it enables direct result comparison across models.
human-baseline performance anchoring
Medium confidenceIncludes human rater performance data for all 23 tasks, establishing ground-truth difficulty calibration and enabling measurement of model-vs-human performance gaps. Tasks were specifically selected where initial model performance fell below human median (50th percentile), creating a calibrated hard benchmark. Human baselines enable researchers to quantify progress toward human-level reasoning and identify tasks where models have surpassed human performance.
Explicitly selected tasks where models underperformed humans at time of curation, creating a self-calibrated hard benchmark where human performance is the reference point rather than an afterthought. This selection strategy ensures the benchmark remains challenging as models improve.
More rigorous than benchmarks without human baselines because it enables quantitative model-vs-human comparison; more meaningful than benchmarks where humans outperform models by large margins, which may indicate task misalignment rather than genuine reasoning difficulty.
standardized multi-task evaluation harness
Medium confidenceProvides consistent evaluation infrastructure across 23 heterogeneous reasoning tasks with unified input/output schemas, metrics computation, and result aggregation. Each task includes standardized test sets, answer formats, and evaluation functions, enabling researchers to run comprehensive benchmarks with a single evaluation script. The harness abstracts task-specific complexity and enables reproducible, comparable results across models and research groups.
Provides unified evaluation infrastructure across heterogeneous task types (arithmetic, logic, spatial, causal) with consistent metrics and result aggregation, rather than requiring task-specific evaluation code. This standardization enables reproducible cross-model comparison and reduces evaluation implementation burden.
More reproducible than ad-hoc evaluation because it enforces consistent metrics and input/output handling; more comprehensive than single-task benchmarks because it enables multi-domain capability assessment in one evaluation run.
algorithmic reasoning task evaluation
Medium confidenceIncludes algorithmic reasoning tasks (e.g., sorting, graph traversal, dynamic programming) that test whether models can learn and apply computational algorithms through few-shot examples. Tasks present problem descriptions and expect models to reason through algorithmic steps, testing whether models can generalize algorithmic patterns beyond memorized examples. This capability isolates algorithmic reasoning from knowledge retrieval or common-sense reasoning.
Isolates algorithmic reasoning as a distinct capability by presenting algorithm problems in natural language with few-shot examples, testing whether models can learn algorithmic patterns without explicit training. This approach measures algorithmic reasoning generalization rather than memorization.
More focused on algorithmic reasoning than general reasoning benchmarks; more accessible than formal algorithm verification tasks because it uses natural language rather than pseudocode or formal logic.
arithmetic and mathematical reasoning evaluation
Medium confidenceIncludes multi-step arithmetic and mathematical reasoning tasks (e.g., word problems, numerical reasoning, mathematical deduction) that test whether models can perform accurate calculations and apply mathematical reasoning through few-shot examples. Tasks range from basic arithmetic to more complex mathematical inference, isolating numerical reasoning from language understanding. Evaluation measures both intermediate calculation accuracy and final answer correctness.
Focuses specifically on multi-step arithmetic and mathematical reasoning through few-shot examples, isolating numerical reasoning capability from general language understanding. Tasks test both calculation accuracy and mathematical inference patterns.
More focused on mathematical reasoning than general reasoning benchmarks; more accessible than formal mathematics verification because it uses natural language problem statements rather than symbolic notation.
logical deduction and inference evaluation
Medium confidenceIncludes logical reasoning tasks (e.g., syllogisms, logical deduction, constraint satisfaction) that test whether models can perform formal logical inference through few-shot examples. Tasks present logical premises and expect models to derive correct conclusions, testing whether models can apply logical rules consistently. This capability isolates formal logical reasoning from common-sense reasoning or knowledge retrieval.
Isolates formal logical reasoning as a distinct capability by presenting logic problems in natural language with few-shot examples, testing whether models can apply logical rules consistently without explicit training. This approach measures logical inference generalization.
More focused on formal logical reasoning than general reasoning benchmarks; more accessible than formal logic verification because it uses natural language rather than symbolic logic notation.
causal reasoning and judgment evaluation
Medium confidenceIncludes causal reasoning tasks that test whether models can identify causal relationships, make causal inferences, and reason about cause-and-effect through few-shot examples. Tasks present scenarios and expect models to identify causal mechanisms or predict causal outcomes, testing whether models can reason about causality beyond correlation. This capability isolates causal reasoning from statistical reasoning or common-sense knowledge.
Focuses specifically on causal reasoning and causal judgment through few-shot examples, isolating causal inference capability from statistical reasoning or common-sense knowledge. Tasks test whether models can identify causal mechanisms rather than just correlations.
More focused on causal reasoning than general reasoning benchmarks; more accessible than formal causal inference because it uses natural language scenarios rather than formal causal models or graphical notation.
spatial reasoning and visualization evaluation
Medium confidenceIncludes spatial reasoning tasks (e.g., mental rotation, spatial visualization, geometric reasoning) that test whether models can reason about spatial relationships and visualize spatial configurations through few-shot examples. Tasks present spatial descriptions and expect models to reason about spatial transformations or configurations, testing whether models can build and manipulate mental spatial models. This capability isolates spatial reasoning from visual perception or geometric knowledge.
Isolates spatial reasoning as a distinct capability by presenting spatial problems in text form with few-shot examples, testing whether models can build and manipulate mental spatial models without visual input. This approach measures pure spatial reasoning capability.
More focused on spatial reasoning than general reasoning benchmarks; more challenging than visual spatial reasoning because it requires models to construct spatial models from text descriptions rather than perceiving visual images.
few-shot prompt engineering and optimization
Medium confidenceEnables researchers to experiment with few-shot prompt engineering by providing curated exemplars for each task that can be modified, reordered, or augmented to test prompt sensitivity and optimization strategies. The dataset structure supports prompt template variation, exemplar selection strategies, and in-context learning optimization without requiring task re-annotation. Researchers can measure how prompt engineering choices affect model performance on hard reasoning tasks.
Provides structured few-shot exemplars that are explicitly designed for prompt engineering experimentation, enabling researchers to test prompt sensitivity and optimization strategies without task re-annotation. The dataset structure supports exemplar variation and prompt template modification.
More suitable for prompt engineering research than generic task collections because it includes curated exemplars; more flexible than fixed-prompt benchmarks because exemplars can be modified and optimized.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with BIG-Bench Hard (BBH), ranked by overlap. Discovered automatically through the match graph.
ARC (AI2 Reasoning Challenge)
7.8K science questions testing genuine reasoning, not just recall.
Falcon 180B
TII's 180B model trained on curated RefinedWeb data.
Mistral: Mistral Large 3 2512
Mistral Large 3 2512 is Mistral’s most capable model to date, featuring a sparse mixture-of-experts architecture with 41B active parameters (675B total), and released under the Apache 2.0 license.
chinese-llm-benchmark
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括374个大模型,覆盖chatgpt、gpt-5.4、谷歌gemini-3.1-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3.6-max、qwen3.6-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.6、ernie4.5、MiniMax-M2.7、deepseek-v4、Qwen3.6、llama4、智谱GLM-5.1、MiMo-V2、LongCat、gemma4、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大
Arcee AI: Trinity Large Preview (free)
Trinity-Large-Preview is a frontier-scale open-weight language model from Arcee, built as a 400B-parameter sparse Mixture-of-Experts with 13B active parameters per token using 4-of-256 expert routing. It excels in creative writing,...
Meta: Llama 3.3 70B Instruct
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model...
Best For
- ✓ML researchers evaluating frontier model capabilities on reasoning
- ✓Teams developing reasoning-focused LLMs and wanting standardized hard benchmarks
- ✓Practitioners testing whether prompt engineering with CoT improves their model's performance
- ✓Model developers doing capability analysis and debugging reasoning failures
- ✓Researchers studying which reasoning types are hardest for LLMs
- ✓Teams building specialized reasoning modules and needing domain-specific evaluation
- ✓Frontier model developers evaluating new model releases on hard reasoning
- ✓Researchers publishing model results and needing standardized hard benchmarks
Known Limitations
- ⚠Few-shot examples are static and hand-crafted — no automatic generation or adaptation to model-specific weaknesses
- ⚠CoT format assumes models can follow structured reasoning; doesn't test implicit reasoning or intuition-based problem solving
- ⚠Limited to 23 tasks — may not cover all reasoning domains (e.g., creative reasoning, social reasoning)
- ⚠Examples are English-only; no multilingual reasoning evaluation
- ⚠Task domains are predefined and fixed — no ability to add custom reasoning categories
- ⚠No cross-domain transfer analysis built-in; requires manual correlation analysis
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Curated subset of 23 challenging tasks from Google's Beyond the Imitation Game (BIG-Bench) benchmark where language models initially performed below average human raters. Tasks include algorithmic reasoning, multi-step arithmetic, logical deduction, causal judgment, and spatial reasoning. Each task includes few-shot chain-of-thought examples. Specifically selected to test the limits of current models on hard reasoning rather than knowledge retrieval. Used to evaluate frontier model improvements.
Categories
Alternatives to BIG-Bench Hard (BBH)
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of BIG-Bench Hard (BBH)?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →