via “assertion-based output grading and evaluation metrics”
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.
Unique: Supports a hybrid grading model combining deterministic assertions (regex, JSON schema) with probabilistic LLM-based graders in a single test case. Graders are composable and can be chained; results are normalized to 0-1 scores for aggregation. Custom graders are first-class citizens, enabling domain-specific evaluation logic without framework modifications.
vs others: More flexible than simple string matching because it supports semantic similarity and LLM-as-judge, and more transparent than black-box quality metrics because each assertion is independently auditable and results are disaggregated by assertion type.