FinQA
DatasetFree8.3K financial reasoning questions over real S&P 500 earnings reports.
Capabilities7 decomposed
multi-step numerical reasoning over financial documents
Medium confidenceEnables evaluation of AI systems' ability to perform chained mathematical operations (addition, subtraction, multiplication, division, comparisons) across both structured tables and unstructured text extracted from SEC filings. The dataset provides ground-truth question-answer pairs where answers require synthesizing data from multiple locations within earnings reports and applying sequential arithmetic operations, testing whether models can decompose complex financial queries into discrete computational steps.
Combines real SEC filing documents (not synthetic) with crowdsourced questions requiring multi-step arithmetic, creating a hybrid dataset that tests both domain knowledge extraction and quantitative reasoning in a single evaluation task. Unlike generic math word problems, answers require locating figures within 10+ page documents first.
More challenging than DROP or SVAMP because it requires financial domain knowledge AND document retrieval before arithmetic, whereas generic math benchmarks assume figures are already extracted
financial domain knowledge evaluation through earnings report comprehension
Medium confidenceAssesses whether AI systems understand financial terminology, accounting concepts, and domain-specific metrics by requiring them to answer questions about real earnings reports from S&P 500 companies. The dataset tests recognition of financial line items (revenue, COGS, operating expenses, net income), ability to distinguish between different financial statements (income statement vs balance sheet), and understanding of financial ratios and metrics without explicit instruction on their definitions.
Uses authentic SEC filings rather than synthetic financial data, exposing models to real-world accounting variations, footnote complexity, and the actual structure of professional financial documents. This tests transfer learning from general text to specialized domain without domain-specific pretraining.
More authentic than synthetic financial QA datasets because it uses real earnings reports with their inherent complexity, but narrower than general financial knowledge benchmarks because it focuses only on historical data interpretation
structured table extraction and reasoning from mixed-format documents
Medium confidenceEnables evaluation of AI systems' ability to extract numerical data from both structured HTML/text tables and unstructured prose within the same document, then reason over the extracted values. The dataset contains questions where relevant data appears in different formats — some figures are in formatted tables with clear row/column headers, while others are embedded in narrative text or footnotes — requiring robust parsing and entity linking before computation can occur.
Combines structured table data with unstructured narrative in the same evaluation, forcing systems to handle format heterogeneity and resolve references across different data representations. Most table QA datasets use clean, isolated tables; this tests real-world document complexity.
More realistic than isolated table QA benchmarks (like SQA or WikiTableQuestions) because it requires handling narrative context and format mixing, but simpler than full document understanding because tables are already in text format (no OCR needed)
benchmark dataset curation and annotation for financial ai evaluation
Medium confidenceProvides a curated, crowdsourced-annotated dataset of 8,281 question-answer pairs with multi-step reasoning requirements, enabling systematic evaluation of AI systems on financial numerical reasoning. The dataset includes quality control mechanisms through crowdworker annotation, answer validation against ground truth, and coverage across diverse financial metrics and company types within the S&P 500, creating a reproducible evaluation standard for the financial AI community.
Provides a publicly available, reproducible benchmark specifically designed for financial numerical reasoning with real SEC filings, enabling standardized comparison across different financial AI systems. Most financial datasets are proprietary or synthetic; this is open-source and authentic.
More specialized and challenging than generic QA benchmarks (SQuAD, MRQA) because it requires financial domain knowledge and multi-step arithmetic, but narrower in scope than comprehensive financial understanding benchmarks because it focuses only on numerical reasoning
multi-hop reasoning evaluation across document sections
Medium confidenceAssesses AI systems' ability to perform multi-hop reasoning by requiring them to locate and combine information from different sections of earnings reports. Questions may require finding a figure in the income statement, then locating a related metric in the balance sheet, then performing arithmetic across both — testing whether models can maintain context across document boundaries and understand relationships between different financial statement sections.
Embeds multi-hop reasoning requirements within authentic financial documents where hops correspond to real relationships between financial statement sections, rather than synthetic reasoning chains. This tests whether models understand domain structure, not just generic multi-hop patterns.
More realistic than synthetic multi-hop datasets (HotpotQA, 2WikiMultiHopQA) because reasoning hops follow actual financial relationships, but less controlled because document structure varies and reasoning paths are implicit rather than explicitly annotated
arithmetic operation type classification and execution
Medium confidenceEnables evaluation of whether AI systems can identify which arithmetic operations (addition, subtraction, multiplication, division, comparison) are required to answer financial questions, then execute them correctly. The dataset implicitly tests operation selection — a question asking 'what is the profit margin' requires division (net income / revenue), while 'what is total assets' requires addition — forcing models to understand financial semantics before applying math.
Embeds arithmetic operation selection within financial domain context, requiring models to understand that 'margin' semantically maps to division and 'total' maps to addition. This tests semantic grounding of operations, not just arithmetic execution.
More semantically grounded than generic math word problem datasets because operation selection is implicit in financial terminology, but less explicit than datasets with annotated operation types because operations must be inferred
cross-document financial comparison and aggregation
Medium confidenceProvides evaluation capability for AI systems to compare financial metrics across multiple S&P 500 companies or aggregate metrics across different time periods within the same company's earnings reports. While individual questions reference single documents, the dataset structure enables evaluation of systems that can retrieve and compare relevant companies, requiring understanding of which metrics are comparable across entities and how to normalize for company size or accounting differences.
Provides a foundation for evaluating cross-company financial comparison by including diverse S&P 500 companies with different business models and scales, enabling assessment of whether systems can normalize and compare metrics appropriately. Most financial QA datasets focus on single-document questions.
Enables cross-company evaluation unlike single-document QA datasets, but requires external retrieval and comparison logic because the dataset itself contains only single-document questions
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with FinQA, ranked by overlap. Discovered automatically through the match graph.
Eilla AI
Secure AI assistant for document creation and financial...
Athena Intelligence
24/7 Enterprise AI Data Analyst
FinRobot
FinRobot: An Open-Source AI Agent Platform for Financial Analysis using LLMs 🚀 🚀 🚀
GSM8K
8.5K grade school math problems — multi-step reasoning, verifiable solutions, reasoning benchmark.
FinGPT Agent
Open-source AI agent for financial analysis.
FinGPT
FinGPT: Open-Source Financial Large Language Models! Revolutionize 🔥 We release the trained model on HuggingFace.
Best For
- ✓ML researchers evaluating financial reasoning capabilities of LLMs and smaller language models
- ✓FinTech teams building automated financial analysis systems that need quantitative accuracy benchmarks
- ✓AI safety researchers studying numerical hallucination patterns in domain-specific contexts
- ✓Financial services companies building AI assistants for investor relations or earnings analysis
- ✓Academic researchers studying domain adaptation and transfer learning in specialized fields
- ✓FinTech startups evaluating whether general-purpose LLMs have sufficient financial literacy for production use
- ✓Document AI teams building table extraction and understanding systems
- ✓Enterprise search/RAG teams evaluating mixed-format document comprehension
Known Limitations
- ⚠Dataset contains only S&P 500 companies — may not generalize to private company financials or non-US regulatory filings
- ⚠Questions are synthetically generated by crowdworkers, not naturally occurring analyst queries — may miss real-world ambiguities
- ⚠No temporal reasoning required — all questions reference single fiscal periods, not year-over-year trend analysis
- ⚠Limited to English-language documents — no multilingual financial reasoning evaluation
- ⚠Only covers large-cap US companies (S&P 500) — no small-cap, international, or sector-specific financial patterns
- ⚠Questions focus on historical financial data interpretation, not forward-looking analysis or guidance interpretation
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
Financial question answering dataset requiring numerical reasoning over real earnings reports from S&P 500 companies. Contains 8,281 questions with structured tables and unstructured text from SEC filings. Each answer requires multi-step mathematical operations (addition, subtraction, multiplication, division, comparisons) over financial data. Tests both financial domain understanding and quantitative reasoning. Critical benchmark for evaluating AI systems intended for financial analysis and automated reporting.
Categories
Alternatives to FinQA
Open-source image generation — SD3, SDXL, massive ecosystem of LoRAs, ControlNets, runs locally.
Compare →Are you the builder of FinQA?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →