via “answer parsing and correctness evaluation with multiple-choice validation”
Graduate-level expert QA — unsearchable questions in biology, physics, chemistry for deep reasoning.
Unique: Centralizes answer parsing logic in shared utilities module, ensuring consistent evaluation across different prompting strategies and model providers. Handles multiple answer formats (direct selection, spelled-out options, explanations with embedded answers) through heuristic pattern matching.
vs others: More robust than simple string matching because it handles formatting variations and embedded answers, whereas naive evaluation scripts may mark correct answers as incorrect due to formatting differences (e.g., 'answer: A' vs 'A' vs 'option A').