opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
BenchmarkFreeopus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
- Best for
- benchmark scoring analysis, version performance comparison
- Type
- Benchmark · Free
- Score
- 42/100
- Best alternative
- v0
Capabilities2 decomposed
benchmark scoring analysis
Medium confidenceOpus 4.7 evaluates its performance against the NYT Connections extended benchmark by analyzing the results of its scoring algorithm, which utilizes a comparison of word associations and connections. The implementation leverages statistical models to determine the accuracy of connections made, allowing for a clear metric of performance. This capability is distinct in its ability to provide detailed breakdowns of scoring discrepancies between versions, such as the significant drop from 94.7% to 41.0%.
Utilizes a comparative scoring system that highlights performance shifts between versions, providing insights into model evolution.
Offers a more detailed version comparison than typical benchmarks, which often provide only aggregate scores.
version performance comparison
Medium confidenceOpus 4.7 includes a capability to compare its performance metrics against previous versions, specifically focusing on the NYT Connections benchmark scores. This is achieved through a structured logging system that captures and analyzes historical performance data, allowing users to visualize trends and identify regression points. The distinct aspect of this capability is its emphasis on version-to-version analysis rather than just absolute performance metrics.
Focuses on historical performance data to provide insights into model regressions and improvements over time.
More granular in tracking performance changes than standard benchmarking tools, which often lack historical context.
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%., ranked by overlap. Discovered automatically through the match graph.
Open LLM Leaderboard
Hugging Face open-source LLM leaderboard — standardized benchmarks, automatic evaluation.
open_llm_leaderboard
open_llm_leaderboard — AI demo on HuggingFace
Pgrammer
Revolutionize coding interview prep with AI-driven, personalized challenges and real-time...
LangSmith
LangChain's LLMOps platform — tracing, evaluation, prompt hub, dataset management, annotation.
PromptBench
Microsoft's unified LLM evaluation and prompt robustness benchmark.
Stable Beluga
A finetuned LLamma 65B...
Best For
- ✓data scientists and AI researchers analyzing model performance
- ✓developers maintaining AI models and seeking to understand performance changes
Known Limitations
- ⚠The benchmark only evaluates specific types of word connections, limiting its applicability to broader language tasks.
- ⚠Requires historical data to be effective; without it, comparisons are limited.
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
About
opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
Categories
Alternatives to opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.
See all alternatives to opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.→Are you the builder of opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%.?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →