Capability
16 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “robustness evaluation via adversarial and distribution-shifted inputs”
Stanford's holistic LLM evaluation — 42 scenarios, 7 metrics including fairness, bias, toxicity.
Unique: Embeds robustness testing into the core evaluation loop by generating multiple perturbed versions of each scenario (typos, paraphrases, out-of-distribution examples) and measuring accuracy degradation. Treats robustness as a first-class metric alongside accuracy rather than a post-hoc analysis.
vs others: More systematic than ad-hoc robustness testing because it applies consistent perturbation strategies across all 42 scenarios, enabling fair comparison of robustness profiles across models
via “model comparison and a/b test analysis framework”
Open-source tool for ML observability that runs in your notebook environment, by Arize. Monitor and fine tune LLM, CV and tabular models.
via “model-performance-and-robustness-testing”
via “model-stability-and-robustness-testing”
via “model-robustness-assessment”
via “model-adversarial-robustness-testing”
via “model performance under attack analysis”
via “model-performance-degradation-analysis”
via “model-performance-benchmarking”
via “model performance evaluation and benchmarking”
via “model comparison and evaluation”
via “performance regression testing”
via “a/b testing and model comparison”
via “ab-testing-for-models”
via “model performance benchmarking”
via “model performance metrics and evaluation”
Building an AI tool with “Model Performance And Robustness Testing”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.