chinese-llm-benchmark

AgentFree

ReLE评测：中文AI大模型能力评测（持续更新）：目前已囊括359个大模型，覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型，以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜，也提供规模超20

Open Source

/ 100

11 capabilities

Capabilities11 decomposed

multi-domain llm performance evaluation across 8 specialized domains

Medium confidence

Evaluates Chinese LLMs across 8 major domains (Medical, Education, Finance, Law, Administrative Affairs, Psychological Health, Reasoning & Math, Language & Instruction Following) using approximately 300 specific evaluation dimensions. Each domain assessment aggregates task-specific scores (1-5 scale per question) normalized to 0-100 point scale, then combines domain scores to produce overall model rankings. The framework uses domain-specific test questions designed to measure real-world capability rather than general language understanding.

Solves for

Compare performance of 298+ LLMs across specialized knowledge domains to select models for domain-specific applicationsIdentify which models excel in medical reasoning, financial analysis, legal knowledge, or mathematical problem-solvingBenchmark commercial vs open-source models within the same capability tier to make cost-performance tradeoffsTrack model improvement over time by re-evaluating against consistent domain-specific test suites

Best for

ML researchers and practitioners evaluating Chinese LLM suitability for domain-specific applications

Organizations selecting models for regulated industries (healthcare, finance, legal) requiring domain expertise validation

Model developers benchmarking improvements against 298 competing models across standardized dimensions

Requires

Access to ReLE evaluation framework (open-source, available on GitHub)

Model API access or local deployment capability for models being evaluated

Chinese language proficiency to interpret domain-specific test questions and results

Limitations

Evaluation methodology and scoring rubrics not fully transparent in public documentation — difficult to reproduce exact scores

Domain coverage limited to 8 major areas; specialized domains (robotics, chemistry, biology) not separately evaluated

Evaluation frequency and update cadence not specified — leaderboard staleness risk for rapidly evolving models

What makes it unique

Combines 8 specialized domain evaluations (Medical, Finance, Law, etc.) with ~300 evaluation dimensions specifically designed for Chinese LLMs, rather than generic language benchmarks. Aggregates individual question scores (1-5 scale) into normalized domain scores (0-100) then composite rankings, enabling cross-domain capability comparison. Maintains 2M+ defect library linking model failures to specific domains for root-cause analysis.

vs alternatives

Deeper domain specialization than MMLU or C-Eval (which focus on general knowledge) and Chinese-specific evaluation design vs English-centric benchmarks like HELM or LMSys Chatbot Arena

multi-tier model leaderboard organization with category-based filtering

Medium confidence

Organizes 298 evaluated models into hierarchical leaderboards using primary classification (commercial vs open-source) and secondary tiers (price tier for commercial models, parameter size for open-source models). The system maintains separate ranked lists for each category, enabling users to compare models within similar cost/capability profiles. Leaderboard data is stored in markdown files (commerce2.md, reasonmodel.md, alldata.md) with model metadata (name, version, provider, parameters, pricing) and performance scores aggregated from domain evaluations.

Solves for

Find the best-performing model within a specific budget tier (e.g., cheapest models under $0.01/1K tokens)Compare open-source models of similar parameter size to identify efficiency leadersIdentify commercial model alternatives at different price points with comparable performanceDiscover reasoning-specialized models ranked separately from general-purpose models

Best for

Teams with budget constraints needing to identify best-value models within cost tiers

Developers choosing between open-source models with similar parameter counts

Organizations evaluating commercial model subscriptions with price-performance analysis

Requires

Access to leaderboard markdown files in repository

Model metadata including pricing, parameter count, and provider information

Evaluation scores from multi-domain assessment capability

Limitations

Leaderboard tiers are static categories — no dynamic tier assignment based on real-time pricing changes

Price tier definitions not explicitly documented — unclear how models are assigned to cost brackets

Parameter size data may be outdated for frequently-updated models

What makes it unique

Implements multi-dimensional leaderboard organization (commercial/open-source primary split, then price tier or parameter size secondary split) with separate ranked lists for reasoning-specialized models. Uses markdown-based leaderboard storage (commerce2.md, reasonmodel.md, alldata.md) enabling version control and community contributions. Maintains model metadata (provider, parameters, pricing) alongside evaluation scores for context-aware comparison.

vs alternatives

More granular category-based filtering than MMLU leaderboards (which use single global ranking) and explicit price-tier organization vs Hugging Face Model Hub (which lacks domain-specific performance context)

model metadata management and comprehensive model information system

Medium confidence

Maintains comprehensive metadata for 298+ evaluated models including name, version, provider/developer organization, model type (commercial/open-source), parameter count, pricing information, release date, and availability status. Metadata is stored alongside evaluation scores in leaderboard files and enables filtering, sorting, and comparison based on model attributes. The system tracks model evolution (versions, updates) and maintains historical metadata for deprecated or superseded models.

Solves for

Filter models by specific attributes (provider, parameter size, pricing tier, availability)Track model versions and identify which version was evaluatedCompare models from same provider to identify best performerMaintain accurate model information for production deployment decisions

Best for

Teams building model selection tools requiring comprehensive model metadata

Researchers analyzing model landscape by provider, size, or pricing

Organizations tracking model versions for reproducibility and audit trails

Requires

Model metadata collection and curation process (manual or automated)

Leaderboard markdown files containing model information

Version control system (Git) for tracking metadata changes

Limitations

Metadata update frequency not specified — pricing and availability data may become stale

No structured metadata format documented — metadata embedded in markdown leaderboards rather than structured database

Version tracking incomplete — unclear if all model versions are tracked or only latest versions

What makes it unique

Maintains comprehensive metadata for 298+ models (name, version, provider, parameters, pricing, availability) alongside evaluation scores in leaderboard files. Enables attribute-based filtering and comparison (by provider, parameter size, pricing tier). Tracks model versions and evolution over time within version-controlled repository.

vs alternatives

Integrated metadata with evaluation scores vs separate model registries (Hugging Face, OpenRouter) and version-controlled metadata history vs static model information

defect library indexing and error pattern analysis across 2m+ model failures

Medium confidence

Maintains a defect library containing over 2 million documented model errors collected during evaluation across all domains and models. The system indexes failures by model, domain, question type, and error category, enabling researchers to identify systematic failure patterns. Defect records link specific model errors to evaluation questions, domain context, and error classification, supporting root-cause analysis and model improvement research. The library serves as a queryable knowledge base for understanding model weaknesses rather than just performance scores.

Solves for

Identify systematic failure patterns in a specific model (e.g., 'Claude-4 fails on medical reasoning questions involving drug interactions')Analyze domain-specific error distributions to understand which domains are hardest across all modelsFind models with similar error profiles to understand capability gaps in the LLM landscapeExtract failure examples for model fine-tuning or adversarial testing datasets

Best for

Model developers analyzing failure modes to guide fine-tuning or RLHF improvements

Researchers studying systematic weaknesses in Chinese LLM capabilities

Teams building domain-specific model selection tools that need to understand failure patterns

Requires

Access to defect library database or export (format/API not specified in documentation)

Model identifiers and evaluation question IDs to cross-reference errors

Understanding of error classification scheme used in library

Limitations

Defect library structure and query interface not documented — unclear how to programmatically access 2M+ errors

Error classification taxonomy not specified — unknown how failures are categorized (hallucination, reasoning error, knowledge gap, etc.)

No temporal tracking of defects — unclear if library tracks when errors were introduced or fixed across model versions

What makes it unique

Aggregates 2M+ model failures into indexed defect library linked to specific evaluation questions, domains, and models — enabling systematic error pattern analysis rather than just aggregate scores. Supports cross-model error comparison to identify shared weaknesses and domain-specific failure distributions. Provides raw failure examples for fine-tuning and adversarial testing rather than only summary statistics.

vs alternatives

More comprehensive failure documentation than MMLU or C-Eval (which report only aggregate accuracy) and enables error-driven model improvement vs score-only benchmarks

chinese language-specific evaluation with gaokao-level academic assessment

Medium confidence

Implements specialized evaluation for Chinese language understanding and instruction following, including Gaokao (Chinese college entrance exam) level questions that test reading comprehension, writing quality, and complex reasoning in Chinese. The evaluation framework includes domain-specific language tasks (medical terminology understanding, legal document interpretation, financial report analysis) alongside general Chinese language proficiency assessment. Scoring incorporates both accuracy and response quality (1-5 scale) to capture nuanced language performance beyond binary correctness.

Solves for

Evaluate LLM capability to handle complex Chinese language tasks at academic/professional levelAssess models for Chinese content generation applications (writing, summarization, translation)Benchmark instruction-following ability in Chinese-specific contexts with cultural/linguistic nuancesIdentify models suitable for Chinese educational or professional applications requiring high language quality

Best for

Chinese organizations deploying LLMs for content generation, customer service, or knowledge work

Researchers studying Chinese language model capabilities vs English-centric models

Educational institutions evaluating models for student assistance or tutoring applications

Requires

Chinese language proficiency to interpret test questions and evaluate response quality

Access to Gaokao question bank or equivalent academic Chinese assessment dataset

Human raters or automated scoring system for language quality assessment (1-5 scale)

Limitations

Gaokao-level assessment methodology not detailed — unclear how questions are selected, difficulty calibrated, or scoring rubrics applied

Language quality scoring (1-5 scale) is subjective — no inter-rater reliability metrics or scoring guidelines published

Evaluation limited to Simplified Chinese — no Traditional Chinese or regional dialect assessment

What makes it unique

Incorporates Gaokao (Chinese college entrance exam) level questions into evaluation framework, testing academic-level Chinese language understanding and writing quality. Combines general language proficiency assessment with domain-specific language tasks (medical terminology, legal documents, financial reports in Chinese). Uses 1-5 quality scale for response evaluation rather than binary correctness, capturing nuanced language performance.

vs alternatives

Chinese-specific academic assessment vs English-centric benchmarks (MMLU, HELM) and Gaokao-level difficulty calibration vs generic language benchmarks

mathematical reasoning and logic problem evaluation with specialized scoring

Medium confidence

Evaluates models on mathematical computation, logical reasoning, and complex problem-solving through domain-specific test questions in the 'Reasoning & Math' category. The evaluation framework assesses both correctness of final answers and quality of reasoning steps (1-5 scale), capturing partial credit for correct methodology with computational errors. Supports multi-step reasoning problems, symbolic manipulation, and logical inference tasks designed to test mathematical capability beyond simple arithmetic.

Solves for

Identify models suitable for mathematical tutoring, homework assistance, or STEM education applicationsBenchmark reasoning capability to select models for complex problem-solving tasksEvaluate models for scientific research support requiring mathematical accuracy and logical rigorCompare mathematical reasoning quality across models to identify reasoning-specialized variants

Best for

Educational technology companies building math tutoring or homework help systems

Research teams using LLMs for scientific computation or theoretical problem-solving

Organizations deploying models for financial modeling or quantitative analysis

Requires

Mathematical test question dataset with correct answers and solution steps

Scoring rubric for evaluating reasoning quality and partial credit

Human raters or automated scoring system for 1-5 quality assessment

Limitations

Scoring methodology for partial credit not documented — unclear how reasoning quality (1-5 scale) maps to mathematical correctness

No distinction between arithmetic errors, conceptual misunderstandings, and logical fallacies in error analysis

Mathematical domain coverage not specified — unclear if evaluation includes calculus, linear algebra, probability, or only basic arithmetic

What makes it unique

Evaluates mathematical reasoning with 1-5 quality scale for reasoning steps rather than binary correctness, enabling partial credit for correct methodology with computational errors. Combines final answer accuracy with reasoning quality assessment to capture mathematical thinking capability. Includes multi-step reasoning problems and logical inference tasks beyond simple arithmetic.

vs alternatives

More nuanced mathematical assessment than MMLU (binary correctness) and captures reasoning quality vs answer-only evaluation

professional domain-specific knowledge evaluation (medical, finance, law, administrative)

Medium confidence

Implements specialized evaluation across four professional domains (Medical, Finance, Law, Administrative Affairs) with domain-expert-designed test questions requiring specialized knowledge and reasoning. Each domain assessment uses realistic scenarios (medical case studies, financial analysis problems, legal document interpretation, administrative policy questions) to evaluate practical professional capability rather than general knowledge. Scoring incorporates domain-specific rubrics reflecting professional standards and best practices in each field.

Solves for

Select models for professional applications (medical documentation, financial advisory, legal research, government services)Evaluate models for compliance with domain-specific knowledge requirements in regulated industriesBenchmark professional-grade models against general-purpose models to justify specialized model selectionIdentify models suitable for domain-specific fine-tuning or RAG augmentation

Best for

Healthcare organizations evaluating models for clinical decision support or medical documentation

Financial institutions assessing models for investment analysis, risk assessment, or financial advisory

Law firms and legal tech companies selecting models for legal research and document analysis

Requires

Domain-expert-designed test questions reflecting professional standards

Domain-specific scoring rubrics aligned with professional best practices

Human raters with domain expertise for quality assessment

Limitations

Domain expertise requirements for test question design and scoring — unclear if domain experts reviewed all questions

No distinction between knowledge-based errors and reasoning errors within domains

Regulatory compliance validation not addressed — models may pass evaluation but fail regulatory requirements

What makes it unique

Evaluates four professional domains (Medical, Finance, Law, Administrative) using domain-expert-designed test questions with realistic scenarios (medical case studies, financial analysis, legal document interpretation) rather than generic knowledge questions. Incorporates domain-specific scoring rubrics reflecting professional standards and best practices. Enables cross-domain comparison to identify models suitable for professional applications.

vs alternatives

More specialized domain assessment than general benchmarks (MMLU, C-Eval) and realistic professional scenarios vs academic knowledge questions

psychological health and mental health knowledge assessment

Medium confidence

Evaluates models on psychological health concepts, mental health counseling knowledge, and psychological reasoning through specialized test questions in the 'Psychological Health' domain. Assessment covers mental health terminology, therapeutic approaches, psychological assessment, and ethical counseling practices. Scoring incorporates both knowledge accuracy and quality of psychological reasoning (1-5 scale) to evaluate capability for mental health support applications.

Solves for

Evaluate models for mental health chatbot or counseling support applicationsAssess models for psychological content generation (mental health education, wellness resources)Benchmark psychological knowledge to identify models suitable for mental health professional support toolsIdentify safety risks in models used for mental health applications

Best for

Mental health technology companies building chatbots or digital therapeutics

Healthcare organizations deploying models for patient education or mental health support

Researchers studying AI capability in mental health domains

Requires

Psychological health test questions designed by mental health professionals

Scoring rubrics reflecting psychological knowledge standards and ethical practices

Human raters with psychology/mental health expertise

Limitations

Psychological assessment methodology not detailed — unclear if evaluation includes ethical considerations or harm prevention

No distinction between knowledge accuracy and therapeutic appropriateness — models may pass knowledge test but give harmful advice

Licensing/credential requirements not addressed — unclear if models are evaluated for compliance with mental health professional standards

What makes it unique

Specialized evaluation of psychological health knowledge and mental health counseling capability using domain-specific test questions. Incorporates 1-5 quality scale for psychological reasoning assessment. Addresses sensitive domain requiring both knowledge accuracy and ethical appropriateness in responses.

vs alternatives

Dedicated mental health domain assessment vs general benchmarks lacking psychological expertise, and explicit safety consideration for sensitive mental health applications

real-time leaderboard updates and continuous model evaluation pipeline

Medium confidence

Implements continuous evaluation pipeline that regularly re-evaluates models and updates leaderboards with new results, maintaining 'Really Reliable Live Evaluation' (ReLE) as the system name indicates. The pipeline processes new model versions, newly released models, and periodic re-evaluation of existing models to keep rankings current. Updates are published to markdown leaderboard files (commerce2.md, reasonmodel.md, alldata.md) enabling version-controlled tracking of ranking changes over time.

Solves for

Monitor how model rankings change over time as new versions are releasedIdentify rapidly improving models that may represent better value than previously-ranked alternativesTrack when new models enter the evaluation system and their initial performanceMaintain current leaderboard data for production model selection decisions

Best for

Organizations making ongoing model selection decisions requiring current performance data

Researchers tracking LLM capability evolution over time

Model developers monitoring competitive positioning against other models

Requires

Automated evaluation pipeline infrastructure (not publicly documented)

Access to new model versions and APIs for continuous evaluation

Git repository access to publish updated leaderboard files

Limitations

Update frequency and schedule not documented — unclear how often leaderboards are refreshed

No API or programmatic access to leaderboard updates — requires manual GitHub polling or RSS monitoring

Historical leaderboard snapshots not maintained — difficult to analyze ranking changes over time

What makes it unique

Implements 'Really Reliable Live Evaluation' (ReLE) with continuous evaluation pipeline that regularly re-evaluates models and updates leaderboards, maintaining current rankings as new models and versions emerge. Uses version-controlled markdown files (commerce2.md, reasonmodel.md, alldata.md) to track ranking changes over time. Enables tracking of model capability evolution rather than static one-time benchmarking.

vs alternatives

Continuous evaluation vs one-time benchmarks (MMLU, C-Eval) and version-controlled leaderboard history vs static rankings

commercial vs open-source model comparison with price-performance analysis

Medium confidence

Enables direct comparison between commercial models (ChatGPT, Claude, Gemini, Qwen, etc.) and open-source models (DeepSeek, Llama, Phi, etc.) by organizing leaderboards with separate commercial and open-source tiers. Commercial models are further categorized by pricing tier (e.g., ultra-cheap, standard, premium), while open-source models are categorized by parameter size (7B, 13B, 70B, etc.). This structure enables price-performance analysis comparing commercial API costs against open-source deployment costs.

Solves for

Evaluate whether expensive commercial models justify their cost vs cheaper open-source alternativesIdentify best-value open-source models within parameter size constraintsCompare commercial model pricing tiers to find cost-optimal optionsMake build-vs-buy decisions by comparing commercial API costs against open-source deployment infrastructure

Best for

Cost-conscious teams evaluating LLM deployment options with budget constraints

Organizations comparing commercial API subscriptions vs self-hosted open-source models

Teams with specific infrastructure constraints (on-premise, edge deployment) evaluating open-source options

Requires

Current pricing data for commercial models (API costs per token)

Parameter count and model size data for open-source models

Evaluation scores from multi-domain assessment

Limitations

Price tier definitions not documented — unclear how commercial models are assigned to cost brackets

Open-source deployment costs not included in analysis — comparison only includes model capability, not infrastructure costs

Inference speed and latency not factored into price-performance analysis — only evaluation scores used

What makes it unique

Organizes leaderboards with explicit commercial vs open-source separation, then further categorizes commercial models by pricing tier and open-source models by parameter size. Enables direct price-performance comparison between commercial API costs and open-source deployment options. Maintains separate ranked lists for each category enabling cost-constrained model selection.

vs alternatives

Explicit price-tier organization vs Hugging Face Model Hub (which lacks pricing context) and commercial/open-source comparison vs single-model-type benchmarks

reasoning-specialized model identification and separate ranking

Medium confidence

Identifies and separately ranks models with specialized reasoning capabilities (e.g., DeepSeek-R1, o1-mini, reasoning-optimized variants) in dedicated leaderboard (reasonmodel.md). These models are evaluated on the same domain tasks but ranked separately to highlight their specialized reasoning strengths. The system recognizes that reasoning-specialized models may have different performance profiles (stronger on math/logic, potentially weaker on general knowledge) and enables comparison within the reasoning-specialist category.

Solves for

Identify reasoning-specialized models suitable for complex problem-solving, mathematical reasoning, or logical inference tasksCompare reasoning-specialized models against each other to find best reasoning capabilityEvaluate whether reasoning specialization justifies additional cost or latency vs general-purpose modelsBenchmark reasoning capability improvements in new model versions

Best for

Teams building applications requiring advanced reasoning (scientific research, mathematical problem-solving, complex logic)

Researchers studying reasoning capability in LLMs

Organizations evaluating whether reasoning-specialized models justify their typically higher cost/latency

Requires

Identification of reasoning-specialized models (manual curation or automated detection)

Evaluation scores from multi-domain assessment

Separate leaderboard file (reasonmodel.md) for reasoning-specialist rankings

Limitations

Definition of 'reasoning-specialized' not documented — unclear which models qualify for reasoning leaderboard

Reasoning capability not separately scored — reasoning models ranked on same domains as general-purpose models

No analysis of reasoning-specialization tradeoffs (e.g., improved math but degraded general knowledge)

What makes it unique

Identifies and separately ranks reasoning-specialized models (e.g., DeepSeek-R1, o1-mini) in dedicated leaderboard (reasonmodel.md) rather than mixing with general-purpose models. Recognizes that reasoning-specialized models have distinct performance profiles and enables category-specific comparison. Maintains separate ranking for models optimized for complex reasoning tasks.

vs alternatives

Explicit reasoning-specialist categorization vs single global leaderboard (which obscures reasoning-specialization benefits) and dedicated reasoning evaluation vs general benchmarks

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

Related Artifactssharing capabilities

Artifacts that share capabilities with chinese-llm-benchmark, ranked by overlap. Discovered automatically through the match graph.

Benchmark39

HELM

Stanford's holistic LLM evaluation — 42 scenarios, 7 metrics including fairness, bias, toxicity.

cross-model performance comparison and ranking with statistical significance testingmulti-scenario language model evaluation across 42 standardized benchmarks

2 shared capabilities

Benchmark39

LMSYS Chatbot Arena

Crowdsourced LLM evaluation — side-by-side blind voting, Elo ratings, most trusted LLM benchmark.

category-specific leaderboard segmentation and filteringpairwise comparative llm evaluation via crowdsourced voting

2 shared capabilities

Web App22

open_llm_leaderboard

open_llm_leaderboard — AI demo on HuggingFace

automated-llm-benchmark-evaluation-pipelinemulti-benchmark-aggregation-and-ranking

2 shared capabilities

Benchmark12

SEAL LLM Leaderboard

Expert-driven LLM benchmarks and updated AI model leaderboards.

expert-curated llm model benchmarking with dynamic leaderboard ranking

1 shared capability

Product29

DeepChecks

Automates and monitors LLMs for quality, compliance, and...

multi-model llm comparison and benchmarking

1 shared capability

Benchmark39

LiveBench

Continuously updated contamination-free LLM benchmark.

multi-domain capability assessment across math, coding, reasoning, language, and data analysis

1 shared capability

Best For

✓ML researchers and practitioners evaluating Chinese LLM suitability for domain-specific applications
✓Organizations selecting models for regulated industries (healthcare, finance, legal) requiring domain expertise validation
✓Model developers benchmarking improvements against 298 competing models across standardized dimensions
✓Teams with budget constraints needing to identify best-value models within cost tiers
✓Developers choosing between open-source models with similar parameter counts
✓Organizations evaluating commercial model subscriptions with price-performance analysis
✓Teams building model selection tools requiring comprehensive model metadata
✓Researchers analyzing model landscape by provider, size, or pricing

Known Limitations

⚠Evaluation methodology and scoring rubrics not fully transparent in public documentation — difficult to reproduce exact scores
⚠Domain coverage limited to 8 major areas; specialized domains (robotics, chemistry, biology) not separately evaluated
⚠Evaluation frequency and update cadence not specified — leaderboard staleness risk for rapidly evolving models
⚠No per-sample error analysis or failure mode categorization within domains — only aggregate scores provided
⚠Leaderboard tiers are static categories — no dynamic tier assignment based on real-time pricing changes
⚠Price tier definitions not explicitly documented — unclear how models are assigned to cost brackets

Requirements

Access to ReLE evaluation framework (open-source, available on GitHub)Model API access or local deployment capability for models being evaluatedChinese language proficiency to interpret domain-specific test questions and resultsAccess to leaderboard markdown files in repositoryModel metadata including pricing, parameter count, and provider informationEvaluation scores from multi-domain assessment capabilityModel metadata collection and curation process (manual or automated)Leaderboard markdown files containing model information

Input / Output

Accepts: LLM model identifiers (name, version, provider), Domain-specific test questions in Chinese (medical cases, financial scenarios, legal questions, etc.), Model responses to evaluation prompts, Model metadata (name, version, provider, parameters, pricing), Domain evaluation scores (0-100 per domain), Model classification (commercial/open-source, category tier), Model name, version, and provider information, Model type classification (commercial/open-source), Parameter count and model size, Pricing information (API costs or deployment requirements), Release date and availability status, Model identifiers (name, version), Domain context (Medical, Finance, Law, etc.), Evaluation question IDs or content, Error classification tags, Chinese language test questions (reading comprehension, writing prompts, instruction-following tasks), LLM responses in Chinese, Domain context (general language, medical Chinese, legal Chinese, etc.), Mathematical problems (arithmetic, algebra, geometry, logic puzzles), Multi-step reasoning problems, LLM responses with reasoning steps and final answers, Professional domain test questions (medical cases, financial scenarios, legal documents, administrative policies), LLM responses to domain-specific prompts, Domain context and professional standards, Psychological health test questions (mental health scenarios, therapeutic questions, psychological assessment cases), LLM responses to mental health prompts, Mental health domain context, New model versions and releases, Updated model metadata (pricing, parameters, availability), Evaluation test questions and datasets, Model type (commercial vs open-source), Pricing information (API costs for commercial, parameter size for open-source), Evaluation scores (0-100 per domain), Model metadata (provider, version, availability), Model identifiers for reasoning-specialized models, Evaluation scores across all 8 domains, Model metadata (reasoning approach, inference time, cost)

Produces: Numerical scores (0-100) per domain, Overall composite score (average across 8 domains), Ranked leaderboard position within model category (commercial/open-source, price tier, parameter size), Domain-specific performance breakdown, Ranked leaderboard lists (markdown format), Model comparison tables with scores and metadata, Category-filtered rankings (e.g., 'best open-source models under 7B parameters'), Model metadata records with all attributes, Filtered model lists by attribute (provider, size, price tier), Model comparison tables with metadata and scores, Model version history and evolution tracking, Defect records with model, domain, question, and error details, Error pattern summaries (e.g., '45% of medical errors involve drug interaction reasoning'), Failure example datasets for analysis or fine-tuning, Model-to-model error similarity matrices, Language proficiency scores (0-100 per language domain), Gaokao-level performance metrics, Instruction-following accuracy in Chinese contexts, Response quality ratings (1-5 scale) with qualitative feedback, Mathematical reasoning scores (0-100), Correctness metrics (accuracy of final answers), Reasoning quality ratings (1-5 scale), Error categorization (computational vs conceptual errors), Domain-specific knowledge scores (0-100 per domain), Professional capability ratings (1-5 scale), Domain-specific error analysis, Suitability assessment for professional applications, Psychological health knowledge scores (0-100), Mental health reasoning quality ratings (1-5 scale), Suitability assessment for mental health applications, Safety risk indicators, Updated leaderboard markdown files with new rankings, Timestamp metadata for leaderboard update dates, Version control history of ranking changes, New model entries with initial evaluation scores, Separate leaderboards for commercial and open-source models, Price-tier-specific rankings (e.g., 'best models under $0.01/1K tokens'), Parameter-size-specific rankings (e.g., 'best 7B parameter models'), Price-performance comparison tables, Reasoning-specialized model leaderboard (reasonmodel.md), Separate rankings for reasoning-specialist category, Reasoning capability comparison data, Reasoning-specialization tradeoff analysis

UnfragileRank

Adoption58%(30% weight)

Quality55%(25% weight)

Ecosystem62%(20% weight)

Match Graph10%(20% weight)

Freshness75%(5% weight)

UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.

Type: Agent

11 capabilities

Visit chinese-llm-benchmark→

Repository Details

5,891

Stars

239

Forks

Topics

agentic-aiartificial-intelligencellm-agentllm-evaluation

Last commit: Apr 22, 2026

About

Alternatives to chinese-llm-benchmark

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

Are you the builder of chinese-llm-benchmark?

Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.

Claim this artifact →Verification via email

Get the weekly brief

New tools, rising stars, and what's actually worth your time. No spam.

Data Sources

github

Looking for something else?

Search →

Capabilities11 decomposed

multi-domain llm performance evaluation across 8 specialized domains

Medium confidence

Solves for

Best for

ML researchers and practitioners evaluating Chinese LLM suitability for domain-specific applications

Organizations selecting models for regulated industries (healthcare, finance, legal) requiring domain expertise validation

Model developers benchmarking improvements against 298 competing models across standardized dimensions

Requires

Access to ReLE evaluation framework (open-source, available on GitHub)

Model API access or local deployment capability for models being evaluated

Chinese language proficiency to interpret domain-specific test questions and results

Limitations

Evaluation methodology and scoring rubrics not fully transparent in public documentation — difficult to reproduce exact scores

Domain coverage limited to 8 major areas; specialized domains (robotics, chemistry, biology) not separately evaluated

Evaluation frequency and update cadence not specified — leaderboard staleness risk for rapidly evolving models

What makes it unique

vs alternatives

Deeper domain specialization than MMLU or C-Eval (which focus on general knowledge) and Chinese-specific evaluation design vs English-centric benchmarks like HELM or LMSys Chatbot Arena

multi-tier model leaderboard organization with category-based filtering

Medium confidence

Solves for

Best for

Teams with budget constraints needing to identify best-value models within cost tiers

Developers choosing between open-source models with similar parameter counts

Organizations evaluating commercial model subscriptions with price-performance analysis

Requires

Access to leaderboard markdown files in repository

Model metadata including pricing, parameter count, and provider information

Evaluation scores from multi-domain assessment capability

Limitations

Leaderboard tiers are static categories — no dynamic tier assignment based on real-time pricing changes

Price tier definitions not explicitly documented — unclear how models are assigned to cost brackets

Parameter size data may be outdated for frequently-updated models

What makes it unique

vs alternatives

model metadata management and comprehensive model information system

Medium confidence

Solves for

Best for

Teams building model selection tools requiring comprehensive model metadata

Researchers analyzing model landscape by provider, size, or pricing

Organizations tracking model versions for reproducibility and audit trails

Requires

Model metadata collection and curation process (manual or automated)

Leaderboard markdown files containing model information

Version control system (Git) for tracking metadata changes

Limitations

Metadata update frequency not specified — pricing and availability data may become stale

No structured metadata format documented — metadata embedded in markdown leaderboards rather than structured database

Version tracking incomplete — unclear if all model versions are tracked or only latest versions

What makes it unique

vs alternatives

Integrated metadata with evaluation scores vs separate model registries (Hugging Face, OpenRouter) and version-controlled metadata history vs static model information

defect library indexing and error pattern analysis across 2m+ model failures

Medium confidence

Solves for

Best for

Model developers analyzing failure modes to guide fine-tuning or RLHF improvements

Researchers studying systematic weaknesses in Chinese LLM capabilities

Teams building domain-specific model selection tools that need to understand failure patterns

Requires

Access to defect library database or export (format/API not specified in documentation)

Model identifiers and evaluation question IDs to cross-reference errors

Understanding of error classification scheme used in library

Limitations

Defect library structure and query interface not documented — unclear how to programmatically access 2M+ errors

Error classification taxonomy not specified — unknown how failures are categorized (hallucination, reasoning error, knowledge gap, etc.)

No temporal tracking of defects — unclear if library tracks when errors were introduced or fixed across model versions

What makes it unique

vs alternatives

More comprehensive failure documentation than MMLU or C-Eval (which report only aggregate accuracy) and enables error-driven model improvement vs score-only benchmarks

chinese language-specific evaluation with gaokao-level academic assessment

Medium confidence

Solves for

Best for

Chinese organizations deploying LLMs for content generation, customer service, or knowledge work

Researchers studying Chinese language model capabilities vs English-centric models

Educational institutions evaluating models for student assistance or tutoring applications

Requires

Chinese language proficiency to interpret test questions and evaluate response quality

Access to Gaokao question bank or equivalent academic Chinese assessment dataset

Human raters or automated scoring system for language quality assessment (1-5 scale)

Limitations

Gaokao-level assessment methodology not detailed — unclear how questions are selected, difficulty calibrated, or scoring rubrics applied

Language quality scoring (1-5 scale) is subjective — no inter-rater reliability metrics or scoring guidelines published

Evaluation limited to Simplified Chinese — no Traditional Chinese or regional dialect assessment

What makes it unique

vs alternatives

Chinese-specific academic assessment vs English-centric benchmarks (MMLU, HELM) and Gaokao-level difficulty calibration vs generic language benchmarks

mathematical reasoning and logic problem evaluation with specialized scoring

Medium confidence

Solves for

Best for

Educational technology companies building math tutoring or homework help systems

Research teams using LLMs for scientific computation or theoretical problem-solving

Organizations deploying models for financial modeling or quantitative analysis

Requires

Mathematical test question dataset with correct answers and solution steps

Scoring rubric for evaluating reasoning quality and partial credit

Human raters or automated scoring system for 1-5 quality assessment

Limitations

Scoring methodology for partial credit not documented — unclear how reasoning quality (1-5 scale) maps to mathematical correctness

No distinction between arithmetic errors, conceptual misunderstandings, and logical fallacies in error analysis

Mathematical domain coverage not specified — unclear if evaluation includes calculus, linear algebra, probability, or only basic arithmetic

What makes it unique

vs alternatives

More nuanced mathematical assessment than MMLU (binary correctness) and captures reasoning quality vs answer-only evaluation

professional domain-specific knowledge evaluation (medical, finance, law, administrative)

Medium confidence

Solves for

Best for

Healthcare organizations evaluating models for clinical decision support or medical documentation

Financial institutions assessing models for investment analysis, risk assessment, or financial advisory

Law firms and legal tech companies selecting models for legal research and document analysis

Requires

Domain-expert-designed test questions reflecting professional standards

Domain-specific scoring rubrics aligned with professional best practices

Human raters with domain expertise for quality assessment

Limitations

Domain expertise requirements for test question design and scoring — unclear if domain experts reviewed all questions

No distinction between knowledge-based errors and reasoning errors within domains

Regulatory compliance validation not addressed — models may pass evaluation but fail regulatory requirements

What makes it unique

vs alternatives

More specialized domain assessment than general benchmarks (MMLU, C-Eval) and realistic professional scenarios vs academic knowledge questions

psychological health and mental health knowledge assessment

Medium confidence

Solves for

Best for

Mental health technology companies building chatbots or digital therapeutics

Healthcare organizations deploying models for patient education or mental health support

Researchers studying AI capability in mental health domains

Requires

Psychological health test questions designed by mental health professionals

Scoring rubrics reflecting psychological knowledge standards and ethical practices

Human raters with psychology/mental health expertise

Limitations

Psychological assessment methodology not detailed — unclear if evaluation includes ethical considerations or harm prevention

No distinction between knowledge accuracy and therapeutic appropriateness — models may pass knowledge test but give harmful advice

Licensing/credential requirements not addressed — unclear if models are evaluated for compliance with mental health professional standards

What makes it unique

vs alternatives

Dedicated mental health domain assessment vs general benchmarks lacking psychological expertise, and explicit safety consideration for sensitive mental health applications

real-time leaderboard updates and continuous model evaluation pipeline

Medium confidence

Solves for

Best for

Organizations making ongoing model selection decisions requiring current performance data

Researchers tracking LLM capability evolution over time

Model developers monitoring competitive positioning against other models

Requires

Automated evaluation pipeline infrastructure (not publicly documented)

Access to new model versions and APIs for continuous evaluation

Git repository access to publish updated leaderboard files

Limitations

Update frequency and schedule not documented — unclear how often leaderboards are refreshed

No API or programmatic access to leaderboard updates — requires manual GitHub polling or RSS monitoring

Historical leaderboard snapshots not maintained — difficult to analyze ranking changes over time

What makes it unique

vs alternatives

Continuous evaluation vs one-time benchmarks (MMLU, C-Eval) and version-controlled leaderboard history vs static rankings

commercial vs open-source model comparison with price-performance analysis

Medium confidence

Solves for

Best for

Cost-conscious teams evaluating LLM deployment options with budget constraints

Organizations comparing commercial API subscriptions vs self-hosted open-source models

Teams with specific infrastructure constraints (on-premise, edge deployment) evaluating open-source options

Requires

Current pricing data for commercial models (API costs per token)

Parameter count and model size data for open-source models

Evaluation scores from multi-domain assessment

Limitations

Price tier definitions not documented — unclear how commercial models are assigned to cost brackets

Open-source deployment costs not included in analysis — comparison only includes model capability, not infrastructure costs

Inference speed and latency not factored into price-performance analysis — only evaluation scores used

What makes it unique

vs alternatives

Explicit price-tier organization vs Hugging Face Model Hub (which lacks pricing context) and commercial/open-source comparison vs single-model-type benchmarks

reasoning-specialized model identification and separate ranking

Medium confidence

Solves for

Best for

Teams building applications requiring advanced reasoning (scientific research, mathematical problem-solving, complex logic)

Researchers studying reasoning capability in LLMs

Organizations evaluating whether reasoning-specialized models justify their typically higher cost/latency

Requires

Identification of reasoning-specialized models (manual curation or automated detection)

Evaluation scores from multi-domain assessment

Separate leaderboard file (reasonmodel.md) for reasoning-specialist rankings

Limitations

Definition of 'reasoning-specialized' not documented — unclear which models qualify for reasoning leaderboard

Reasoning capability not separately scored — reasoning models ranked on same domains as general-purpose models

No analysis of reasoning-specialization tradeoffs (e.g., improved math but degraded general knowledge)

What makes it unique

vs alternatives

Explicit reasoning-specialist categorization vs single global leaderboard (which obscures reasoning-specialization benefits) and dedicated reasoning evaluation vs general benchmarks

Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.

About

Alternatives to chinese-llm-benchmark

IntelliCode50Extension

AI-assisted development

Compare →

GitHub Copilot Chat53Extension

AI chat features powered by Copilot

Compare →

GitHub Copilot52Extension

Your AI pair programmer

Compare →

Claude Code for VS Code52Extension

Claude Code for VS Code: Harness the power of Claude Code without leaving your IDE

Compare →

chinese-llm-benchmark

Capabilities11 decomposed

multi-domain llm performance evaluation across 8 specialized domains

multi-tier model leaderboard organization with category-based filtering

model metadata management and comprehensive model information system

defect library indexing and error pattern analysis across 2m+ model failures

chinese language-specific evaluation with gaokao-level academic assessment

mathematical reasoning and logic problem evaluation with specialized scoring

professional domain-specific knowledge evaluation (medical, finance, law, administrative)

psychological health and mental health knowledge assessment

real-time leaderboard updates and continuous model evaluation pipeline

commercial vs open-source model comparison with price-performance analysis

reasoning-specialized model identification and separate ranking

Related Artifactssharing capabilities

HELM

LMSYS Chatbot Arena

open_llm_leaderboard

SEAL LLM Leaderboard

DeepChecks

LiveBench

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to chinese-llm-benchmark

Are you the builder of chinese-llm-benchmark?

Get the weekly brief

Data Sources

chinese-llm-benchmark

Capabilities11 decomposed

multi-domain llm performance evaluation across 8 specialized domains

multi-tier model leaderboard organization with category-based filtering

model metadata management and comprehensive model information system

defect library indexing and error pattern analysis across 2m+ model failures

chinese language-specific evaluation with gaokao-level academic assessment

mathematical reasoning and logic problem evaluation with specialized scoring

professional domain-specific knowledge evaluation (medical, finance, law, administrative)

psychological health and mental health knowledge assessment

real-time leaderboard updates and continuous model evaluation pipeline

commercial vs open-source model comparison with price-performance analysis

reasoning-specialized model identification and separate ranking

Related Artifactssharing capabilities

HELM

LMSYS Chatbot Arena

open_llm_leaderboard

SEAL LLM Leaderboard

DeepChecks

LiveBench

Best For

Known Limitations

Requirements

Input / Output

UnfragileRank

Repository Details

About

Categories

Alternatives to chinese-llm-benchmark

Are you the builder of chinese-llm-benchmark?

Get the weekly brief

Data Sources