Custom Evaluation Definition And Execution

1

OSWorldBenchmark63/100

via “custom execution-based task evaluation”

Real OS benchmark for multimodal computer agents.

Unique: Uses custom per-task evaluation scripts rather than generic scoring functions, enabling task-specific success criteria that capture domain knowledge (e.g., correct file format, application-specific state changes). This approach is more accurate than generic metrics but requires significant engineering effort and domain expertise per task.

vs others: More accurate than generic scoring functions for complex, multi-step tasks, but less scalable and harder to maintain than standardized evaluation metrics used in simpler benchmarks.

2

everything-claude-codeAgent63/100

via “eval-driven development workflow with automated testing”

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

Unique: Integrates eval definition, automated test case generation, and skill evolution into a closed-loop workflow that measures agent performance against quantitative metrics and automatically improves skills based on eval results. Evals are first-class citizens in the development process, not afterthoughts.

vs others: Unlike manual testing or post-hoc evaluation, ECC's eval-driven workflow makes metrics central to development, enabling continuous measurement and automatic skill evolution based on quantitative feedback.

3

WildBenchBenchmark61/100

via “custom evaluation prompt configuration”

Real-world user query benchmark judged by GPT-4.

Unique: Enables users to customize GPT-4 judge prompts for domain-specific evaluation criteria, rather than forcing all evaluations to use fixed helpfulness/safety/instruction-following dimensions. Supports experimentation with different evaluation rubrics and alignment with organizational values.

vs others: More flexible than fixed-criteria benchmarks because it allows domain-specific customization; more practical than building custom evaluation infrastructure because it reuses the WildBench query dataset and judge infrastructure; more transparent than black-box evaluation because users control the evaluation criteria

4

Athina AIDataset59/100

via “custom-evaluation-metric-definition”

LLM eval and monitoring with hallucination detection.

Unique: unknown — insufficient data on custom metric implementation, API surface, and integration with the EvalRunner orchestration system. Documentation does not specify whether custom metrics are Python functions, declarative schemas, or another abstraction.

vs others: unknown — without clarity on implementation approach, cannot position against alternatives like Ragas custom metrics or LangSmith's custom evaluators.

5

Galileo ObserveProduct57/100

AI evaluation platform with automated hallucination detection and RAG metrics.

Unique: Integrates custom evaluation logic directly into production observability pipelines with unlimited custom evaluators on all tiers, rather than requiring separate evaluation frameworks or batch processing jobs

vs others: Offers unlimited custom evaluators on free tier whereas competitors like Arize charge per custom metric, but lacks transparency on implementation mechanism and performance characteristics

6

AgentaRepository56/100

via “automated evaluation pipeline with 20+ built-in evaluators”

Open-source LLMOps platform for prompt management and evaluation.

Unique: Decouples evaluator logic from execution via a plugin registry pattern where evaluators are Python classes implementing a standard interface, allowing users to mix built-in evaluators (regex, similarity, LLM-as-judge) with custom evaluators in a single run. Uses JSON schema generation to auto-expose evaluator parameters in the UI without manual form definition.

vs others: More flexible than Ragas because it supports arbitrary custom evaluators and doesn't require LLM calls for all metrics, reducing cost and latency for simple evaluations like exact-match or regex scoring.

7

browser-devtools-mcpMCP Server33/100

via “javascript-execution-and-evaluation”

MCP Server for Browser Dev Tools

Unique: Exposes CDP Runtime.evaluate as an MCP tool with automatic JSON serialization, allowing agents to execute arbitrary JavaScript without managing CDP protocol details or handling serialization errors manually

vs others: More flexible than DOM-only queries for complex data extraction because it can access JavaScript state and call page functions, but requires careful error handling for non-serializable return values

8

DeepChecksProduct

via “custom evaluation criteria configuration”

9

PromptfooProduct

via “custom evaluator integration”

10

AthinaProduct

via “custom evaluation rule creation and execution”

11

AgentaProduct

via “custom-evaluation-metric-definition”

12

Maxim AIProduct

via “custom evaluation metric definition and tracking”

13

PromptmetheusPrompt

via “manual completion rating and custom evaluator execution”

Unique: Combines manual human-in-the-loop rating with automated custom evaluators in unified evaluation framework, allowing both subjective quality assessment and objective constraint validation in same workflow without context switching

vs others: More flexible than rule-based alternatives because custom evaluators support arbitrary validation logic, versus fixed metric sets that may not capture domain-specific quality criteria

14

Parea AIProduct

via “custom-metric-definition-and-scoring”

15

Query VaryProduct

via “evaluation-metric-definition”

Top Matches

Also Known As

Company