Capability
20 artifacts provide this capability.
Want a personalized recommendation?
Find the best match →via “metric composition and custom criteria evaluation”
RAG evaluation framework — faithfulness, relevancy, context precision/recall metrics.
Unique: Metric system uses inheritance hierarchy (Metric → SingleTurnMetric → specific implementations) with PromptMixin for dynamic prompt management and Instructor adapter for structured output. Supports metric training/alignment workflows to calibrate custom metrics against human judgments.
vs others: More flexible than fixed metric suites because metrics are composable Python objects with pluggable LLM backends, enabling domain-specific evaluation without forking the framework.
via “custom metric submission and ingestion”
Query Datadog metrics, logs, and monitors via MCP.
Unique: Exposes Datadog's metrics API through MCP, allowing Claude to submit custom metrics as part of automation workflows; handles metric type selection and tag formatting transparently
vs others: More integrated than external metric submission tools because Claude can reason about what metrics to submit based on incident context or workflow state
via “custom metric provider system for domain-specific validation”
Data quality validation framework with declarative expectations.
Unique: Implements a MetricProvider registry system that allows custom metrics to be defined once and executed across multiple engines (Pandas, SQL, Spark) by implementing engine-specific compute methods, enabling domain-specific validation without modifying core GX code
vs others: More extensible than fixed expectation sets because custom metrics can implement arbitrary validation logic; more maintainable than custom validation scripts because metrics are registered and reusable across expectations
via “custom metric definition with schema-based validation”
LLM evaluation framework — 14+ metrics, faithfulness/hallucination detection, Pytest integration.
Unique: Provides a BaseMetric abstract class with a standardized measure() interface and optional schema validation, allowing custom metrics to be plugged into the evaluation pipeline without modifying core code; includes helper functions (e.g., G-Eval prompt templates) to reduce boilerplate for common metric patterns
vs others: More extensible than Ragas because it provides clear extension points (BaseMetric subclass) and helper utilities for common patterns, reducing the friction for implementing custom metrics
via “metric-score-aggregation-and-statistical-analysis”
LLM eval and monitoring with hallucination detection.
Unique: Automatically computes statistical summaries and supports grouping by custom dimensions, enabling teams to understand metric distributions without manual analysis. Likely integrates with visualization to surface insights.
vs others: More convenient than manual statistical analysis (e.g., using Pandas), but less flexible than general-purpose statistical tools because aggregation functions and grouping options are likely limited to pre-defined sets.
via “custom metrics definition and aggregation with tags and thresholds”
Developer-centric load testing tool by Grafana Labs.
Unique: Implements custom metrics as first-class objects (Counter, Gauge, Trend, Rate) with tag-based dimensional filtering and integration with the threshold system, enabling business-logic metrics to be treated as SLO criteria without custom scripting
vs others: More flexible than JMeter's custom metrics because metrics are code-based and support tags; more integrated than Locust because custom metrics are automatically exported to backends and included in threshold evaluation
via “metric and scalar logging with real-time streaming and aggregation”
Open-source MLOps — experiment tracking, pipelines, data management, auto-logging, self-hosted.
Unique: Provides flexible metric logging with hierarchical organization, real-time streaming with local buffering, and custom aggregation functions for distributed training, integrated with the Task context
vs others: More flexible than framework-specific logging (PyTorch TensorBoard), but less standardized than OpenTelemetry for observability
via “custom metric definition and tracking”
Formo makes analytics simple for DeFi apps so you can focus on growth. Get the best of web, product, and onchain analytics in one place. Understand who your users are, where they come from, and what they do onchain. The Formo MCP Server enables AI tools like Cursor, Claude Desktop, Claude Code, and
Unique: Empowers users to define their own metrics through a simple interface, allowing for highly personalized analytics that reflect specific business goals.
vs others: More flexible than rigid metric systems that only allow predefined KPIs, enabling businesses to adapt their analytics as they grow.
via “custom metric definition and composition framework”
Evaluation framework for RAG and LLM applications
Unique: Implements a simple base class extension pattern for custom metrics with automatic integration into evaluation pipelines, enabling users to define domain-specific metrics without understanding internal framework architecture; supports metric-specific configuration through constructor parameters
vs others: Lower barrier to entry than building evaluation frameworks from scratch; provides scaffolding and integration points while remaining flexible enough for novel metric implementations
via “custom metric implementation with geval base class”
The LLM Evaluation Framework
Unique: Provides a GEval base class that abstracts LLM-as-judge metric implementation, handling prompt templating, response parsing, and score normalization. Custom metrics inherit caching and provider abstraction from the base class.
vs others: More extensible than fixed metric libraries and more integrated than standalone evaluation scripts because custom metrics inherit framework capabilities (caching, provider abstraction, result aggregation).
via “segment analytics and metrics computation”
Customer segmentation MCP App Server with filtering
Unique: Provides segment-level analytics as an MCP tool, enabling LLM clients to request metrics in natural language and receive structured results for downstream reasoning or visualization
vs others: Faster than querying a data warehouse for segment metrics, and more flexible than pre-computed dashboards because metrics are computed on-demand for any segment definition
Unique: Extensible metric system enabling custom metric definition and aggregation alongside built-in observability, with automatic correlation to experiments and model changes
vs others: More flexible than provider-native metrics (which are fixed) and more integrated than external analytics tools (which require manual data integration)
via “custom metric definition and tracking”
via “custom metric calculation”
via “custom metric definition and tracking for chatbot quality”
Unique: Supports conditional, context-aware metric definitions that activate based on conversation state rather than treating all conversations uniformly — enables business-aligned quality measurement instead of generic accuracy proxies
vs others: More flexible than standard NLU evaluation metrics (BLEU, ROUGE) because it allows domain-specific KPI composition; more accessible than building custom evaluation pipelines from scratch
via “custom-metric-definition”
via “metric definition and management”
via “metric-definition-and-calculation”
via “custom metric and kpi definition”
via “custom-metric-collection”
Building an AI tool with “Custom Metric Definition And Aggregation”?
Submit your artifact →curl unfragile.ai/agents.md | sh© 2026 Unfragile. The platform for software for agents.