chinese-llm-benchmark
AgentFreeReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超20
Capabilities11 decomposed
multi-domain llm performance evaluation across 8 specialized domains
Medium confidenceEvaluates Chinese LLMs across 8 major domains (Medical, Education, Finance, Law, Administrative Affairs, Psychological Health, Reasoning & Math, Language & Instruction Following) using approximately 300 specific evaluation dimensions. Each domain assessment aggregates task-specific scores (1-5 scale per question) normalized to 0-100 point scale, then combines domain scores to produce overall model rankings. The framework uses domain-specific test questions designed to measure real-world capability rather than general language understanding.
Combines 8 specialized domain evaluations (Medical, Finance, Law, etc.) with ~300 evaluation dimensions specifically designed for Chinese LLMs, rather than generic language benchmarks. Aggregates individual question scores (1-5 scale) into normalized domain scores (0-100) then composite rankings, enabling cross-domain capability comparison. Maintains 2M+ defect library linking model failures to specific domains for root-cause analysis.
Deeper domain specialization than MMLU or C-Eval (which focus on general knowledge) and Chinese-specific evaluation design vs English-centric benchmarks like HELM or LMSys Chatbot Arena
multi-tier model leaderboard organization with category-based filtering
Medium confidenceOrganizes 298 evaluated models into hierarchical leaderboards using primary classification (commercial vs open-source) and secondary tiers (price tier for commercial models, parameter size for open-source models). The system maintains separate ranked lists for each category, enabling users to compare models within similar cost/capability profiles. Leaderboard data is stored in markdown files (commerce2.md, reasonmodel.md, alldata.md) with model metadata (name, version, provider, parameters, pricing) and performance scores aggregated from domain evaluations.
Implements multi-dimensional leaderboard organization (commercial/open-source primary split, then price tier or parameter size secondary split) with separate ranked lists for reasoning-specialized models. Uses markdown-based leaderboard storage (commerce2.md, reasonmodel.md, alldata.md) enabling version control and community contributions. Maintains model metadata (provider, parameters, pricing) alongside evaluation scores for context-aware comparison.
More granular category-based filtering than MMLU leaderboards (which use single global ranking) and explicit price-tier organization vs Hugging Face Model Hub (which lacks domain-specific performance context)
model metadata management and comprehensive model information system
Medium confidenceMaintains comprehensive metadata for 298+ evaluated models including name, version, provider/developer organization, model type (commercial/open-source), parameter count, pricing information, release date, and availability status. Metadata is stored alongside evaluation scores in leaderboard files and enables filtering, sorting, and comparison based on model attributes. The system tracks model evolution (versions, updates) and maintains historical metadata for deprecated or superseded models.
Maintains comprehensive metadata for 298+ models (name, version, provider, parameters, pricing, availability) alongside evaluation scores in leaderboard files. Enables attribute-based filtering and comparison (by provider, parameter size, pricing tier). Tracks model versions and evolution over time within version-controlled repository.
Integrated metadata with evaluation scores vs separate model registries (Hugging Face, OpenRouter) and version-controlled metadata history vs static model information
defect library indexing and error pattern analysis across 2m+ model failures
Medium confidenceMaintains a defect library containing over 2 million documented model errors collected during evaluation across all domains and models. The system indexes failures by model, domain, question type, and error category, enabling researchers to identify systematic failure patterns. Defect records link specific model errors to evaluation questions, domain context, and error classification, supporting root-cause analysis and model improvement research. The library serves as a queryable knowledge base for understanding model weaknesses rather than just performance scores.
Aggregates 2M+ model failures into indexed defect library linked to specific evaluation questions, domains, and models — enabling systematic error pattern analysis rather than just aggregate scores. Supports cross-model error comparison to identify shared weaknesses and domain-specific failure distributions. Provides raw failure examples for fine-tuning and adversarial testing rather than only summary statistics.
More comprehensive failure documentation than MMLU or C-Eval (which report only aggregate accuracy) and enables error-driven model improvement vs score-only benchmarks
chinese language-specific evaluation with gaokao-level academic assessment
Medium confidenceImplements specialized evaluation for Chinese language understanding and instruction following, including Gaokao (Chinese college entrance exam) level questions that test reading comprehension, writing quality, and complex reasoning in Chinese. The evaluation framework includes domain-specific language tasks (medical terminology understanding, legal document interpretation, financial report analysis) alongside general Chinese language proficiency assessment. Scoring incorporates both accuracy and response quality (1-5 scale) to capture nuanced language performance beyond binary correctness.
Incorporates Gaokao (Chinese college entrance exam) level questions into evaluation framework, testing academic-level Chinese language understanding and writing quality. Combines general language proficiency assessment with domain-specific language tasks (medical terminology, legal documents, financial reports in Chinese). Uses 1-5 quality scale for response evaluation rather than binary correctness, capturing nuanced language performance.
Chinese-specific academic assessment vs English-centric benchmarks (MMLU, HELM) and Gaokao-level difficulty calibration vs generic language benchmarks
mathematical reasoning and logic problem evaluation with specialized scoring
Medium confidenceEvaluates models on mathematical computation, logical reasoning, and complex problem-solving through domain-specific test questions in the 'Reasoning & Math' category. The evaluation framework assesses both correctness of final answers and quality of reasoning steps (1-5 scale), capturing partial credit for correct methodology with computational errors. Supports multi-step reasoning problems, symbolic manipulation, and logical inference tasks designed to test mathematical capability beyond simple arithmetic.
Evaluates mathematical reasoning with 1-5 quality scale for reasoning steps rather than binary correctness, enabling partial credit for correct methodology with computational errors. Combines final answer accuracy with reasoning quality assessment to capture mathematical thinking capability. Includes multi-step reasoning problems and logical inference tasks beyond simple arithmetic.
More nuanced mathematical assessment than MMLU (binary correctness) and captures reasoning quality vs answer-only evaluation
professional domain-specific knowledge evaluation (medical, finance, law, administrative)
Medium confidenceImplements specialized evaluation across four professional domains (Medical, Finance, Law, Administrative Affairs) with domain-expert-designed test questions requiring specialized knowledge and reasoning. Each domain assessment uses realistic scenarios (medical case studies, financial analysis problems, legal document interpretation, administrative policy questions) to evaluate practical professional capability rather than general knowledge. Scoring incorporates domain-specific rubrics reflecting professional standards and best practices in each field.
Evaluates four professional domains (Medical, Finance, Law, Administrative) using domain-expert-designed test questions with realistic scenarios (medical case studies, financial analysis, legal document interpretation) rather than generic knowledge questions. Incorporates domain-specific scoring rubrics reflecting professional standards and best practices. Enables cross-domain comparison to identify models suitable for professional applications.
More specialized domain assessment than general benchmarks (MMLU, C-Eval) and realistic professional scenarios vs academic knowledge questions
psychological health and mental health knowledge assessment
Medium confidenceEvaluates models on psychological health concepts, mental health counseling knowledge, and psychological reasoning through specialized test questions in the 'Psychological Health' domain. Assessment covers mental health terminology, therapeutic approaches, psychological assessment, and ethical counseling practices. Scoring incorporates both knowledge accuracy and quality of psychological reasoning (1-5 scale) to evaluate capability for mental health support applications.
Specialized evaluation of psychological health knowledge and mental health counseling capability using domain-specific test questions. Incorporates 1-5 quality scale for psychological reasoning assessment. Addresses sensitive domain requiring both knowledge accuracy and ethical appropriateness in responses.
Dedicated mental health domain assessment vs general benchmarks lacking psychological expertise, and explicit safety consideration for sensitive mental health applications
real-time leaderboard updates and continuous model evaluation pipeline
Medium confidenceImplements continuous evaluation pipeline that regularly re-evaluates models and updates leaderboards with new results, maintaining 'Really Reliable Live Evaluation' (ReLE) as the system name indicates. The pipeline processes new model versions, newly released models, and periodic re-evaluation of existing models to keep rankings current. Updates are published to markdown leaderboard files (commerce2.md, reasonmodel.md, alldata.md) enabling version-controlled tracking of ranking changes over time.
Implements 'Really Reliable Live Evaluation' (ReLE) with continuous evaluation pipeline that regularly re-evaluates models and updates leaderboards, maintaining current rankings as new models and versions emerge. Uses version-controlled markdown files (commerce2.md, reasonmodel.md, alldata.md) to track ranking changes over time. Enables tracking of model capability evolution rather than static one-time benchmarking.
Continuous evaluation vs one-time benchmarks (MMLU, C-Eval) and version-controlled leaderboard history vs static rankings
commercial vs open-source model comparison with price-performance analysis
Medium confidenceEnables direct comparison between commercial models (ChatGPT, Claude, Gemini, Qwen, etc.) and open-source models (DeepSeek, Llama, Phi, etc.) by organizing leaderboards with separate commercial and open-source tiers. Commercial models are further categorized by pricing tier (e.g., ultra-cheap, standard, premium), while open-source models are categorized by parameter size (7B, 13B, 70B, etc.). This structure enables price-performance analysis comparing commercial API costs against open-source deployment costs.
Organizes leaderboards with explicit commercial vs open-source separation, then further categorizes commercial models by pricing tier and open-source models by parameter size. Enables direct price-performance comparison between commercial API costs and open-source deployment options. Maintains separate ranked lists for each category enabling cost-constrained model selection.
Explicit price-tier organization vs Hugging Face Model Hub (which lacks pricing context) and commercial/open-source comparison vs single-model-type benchmarks
reasoning-specialized model identification and separate ranking
Medium confidenceIdentifies and separately ranks models with specialized reasoning capabilities (e.g., DeepSeek-R1, o1-mini, reasoning-optimized variants) in dedicated leaderboard (reasonmodel.md). These models are evaluated on the same domain tasks but ranked separately to highlight their specialized reasoning strengths. The system recognizes that reasoning-specialized models may have different performance profiles (stronger on math/logic, potentially weaker on general knowledge) and enables comparison within the reasoning-specialist category.
Identifies and separately ranks reasoning-specialized models (e.g., DeepSeek-R1, o1-mini) in dedicated leaderboard (reasonmodel.md) rather than mixing with general-purpose models. Recognizes that reasoning-specialized models have distinct performance profiles and enables category-specific comparison. Maintains separate ranking for models optimized for complex reasoning tasks.
Explicit reasoning-specialist categorization vs single global leaderboard (which obscures reasoning-specialization benefits) and dedicated reasoning evaluation vs general benchmarks
Capabilities are decomposed by AI analysis. Each maps to specific user intents and improves with match feedback.
Related Artifactssharing capabilities
Artifacts that share capabilities with chinese-llm-benchmark, ranked by overlap. Discovered automatically through the match graph.
HELM
Stanford's holistic LLM evaluation — 42 scenarios, 7 metrics including fairness, bias, toxicity.
LMSYS Chatbot Arena
Crowdsourced LLM evaluation — side-by-side blind voting, Elo ratings, most trusted LLM benchmark.
open_llm_leaderboard
open_llm_leaderboard — AI demo on HuggingFace
SEAL LLM Leaderboard
Expert-driven LLM benchmarks and updated AI model leaderboards.
DeepChecks
Automates and monitors LLMs for quality, compliance, and...
LiveBench
Continuously updated contamination-free LLM benchmark.
Best For
- ✓ML researchers and practitioners evaluating Chinese LLM suitability for domain-specific applications
- ✓Organizations selecting models for regulated industries (healthcare, finance, legal) requiring domain expertise validation
- ✓Model developers benchmarking improvements against 298 competing models across standardized dimensions
- ✓Teams with budget constraints needing to identify best-value models within cost tiers
- ✓Developers choosing between open-source models with similar parameter counts
- ✓Organizations evaluating commercial model subscriptions with price-performance analysis
- ✓Teams building model selection tools requiring comprehensive model metadata
- ✓Researchers analyzing model landscape by provider, size, or pricing
Known Limitations
- ⚠Evaluation methodology and scoring rubrics not fully transparent in public documentation — difficult to reproduce exact scores
- ⚠Domain coverage limited to 8 major areas; specialized domains (robotics, chemistry, biology) not separately evaluated
- ⚠Evaluation frequency and update cadence not specified — leaderboard staleness risk for rapidly evolving models
- ⚠No per-sample error analysis or failure mode categorization within domains — only aggregate scores provided
- ⚠Leaderboard tiers are static categories — no dynamic tier assignment based on real-time pricing changes
- ⚠Price tier definitions not explicitly documented — unclear how models are assigned to cost brackets
Requirements
Input / Output
UnfragileRank
UnfragileRank is computed from adoption signals, documentation quality, ecosystem connectivity, match graph feedback, and freshness. No artifact can pay for a higher rank.
Repository Details
Last commit: Apr 22, 2026
About
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括359个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.6、文心ERNIE-X1.1、ERNIE-5.0、qwen3-max、qwen3.5-plus、百川、讯飞星火、商汤senseChat等商用模型, 以及step3.5-flash、kimi-k2.5、ernie4.5、MiniMax-M2.5、deepseek-v3.2、Qwen3.5、llama4、智谱GLM-5、GLM-4.7、LongCat、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
Categories
Alternatives to chinese-llm-benchmark
Are you the builder of chinese-llm-benchmark?
Claim this artifact to get a verified badge, access match analytics, see which intents users search for, and manage your listing.
Get the weekly brief
New tools, rising stars, and what's actually worth your time. No spam.
Data Sources
Looking for something else?
Search →