The platform offers a wide range of built-in open-source evaluation sets. The following introduces some built-in open-source evaluation sets and the actual scenarios of evaluation models, which provides reference for you during evaluation:
Mathematics
1. GSM8K
Introduction:
GSM8K is a data set that contains approximately 8,500 grade-school math problems. It is primarily designed to test basic mathematical inference capabilities. The problems involve elementary arithmetic knowledge such as addition, subtraction, multiplication, division, fractions, and decimals.
Application:
GSM8K is mainly used to evaluate the capabilities of models to solve grade-school math problems, especially the multi-step mathematical inference capabilities. For example, a model must understand the textual description in a math problem, convert it into mathematical expressions, and output a final answer through multi-step calculations.
Actual scenarios:
In the education field, GSM8K can be used to develop intelligent tutoring systems that help students practice basic math problems and provide detailed solution steps.
In the AI research field, GSM8K is an important benchmark for testing whether a model has basic logical inference capabilities.
2. MATH
Introduction:
MATH is a data set that covers math problems with different difficulty ratings from elementary school to senior high school. It contains 12,500 math problems and can be used to test the advanced mathematical capabilities of models.
Application:
MATH is used to evaluate the capabilities of models in solving complex mathematical problems, covering advanced mathematical knowledge such as algebra, geometry, and calculus.
Actual scenarios:
In academic research, MATH can be used to test whether a model can handle math problems in high school or college, such as solving equations or proving geometric theorems.
In the field of educational technology, MATH can be used to develop advanced mathematics learning tools that help students understand complex mathematical concepts.
3. MATH-500
Introduction:
MATH-500 is a curated subset (sourced from international mathematics competitions such as American Mathematics Competitions (AMC) or American Invitational Mathematics Examination (AIME)) of the classic mathematical inference data set (MATH). It contains 500 competition-level math problems spanning topics such as algebra, geometry, number theory, and combinatorics from high school to early college levels. The math problems emphasize multi-step logical inference and creative problem-solving policies, and require the application of theorems, constructive proofs, and non-standard solution techniques. Its features include:
MATH-500 provides complete problem-solving steps in LaTeX format for each math problem. Difficulty levels range from 1 to 5, with 5 being the most difficult. The highest accuracy achieved by human experts is only 90%, namely, the level of International Mathematical Olympiad (IMO) winners who have won gold medals.
Application:
MATH-500 is mainly used to evaluate the extreme capabilities of LLMs in complex mathematical inference. It specifically tests: symbolic operations (such as polynomial simplification and inequality proofs), abstract modeling (translating verbal descriptions into mathematical structures),
and long-chain inference (maintaining logical consistency beyond 15 steps).
Actual scenarios:
MATH-500 provides adaptive generation of question banks and solution path feedback for competition preparation platforms such as the Art of Problem Solving (AoPS) community.
4. AIME2024
Introduction:
The American Invitational Mathematics Examination (AIME) is a high-difficulty invitational competition for American Mathematics Competitions (AMC) contest winners. AIME 2024 or 2025 is widely used as a gold-standard test set for evaluating the mathematical inference capabilities of LLMs. Its characteristics are as follows: AIME focuses on core Olympiad areas such as number theory, combinatorial optimization, and geometric proofs; AIME emphasizes intuitive insight and calculation efficiency (answer questions within a limited time period); AIME features questions that demand filling in the blanks with integers to avoid linguistic expression interference.
Application:
AIME scores have become a key metric for measuring whether an LLM has achieved a fundamental leap in inference capabilities. For example:
DeepSeek-R1-0528 (updated on May 2025): AIME 2025 accuracy jumped from 70% to 87.5%, and AIME achieved deep thinking by extending the chain of thought to 23,000 tokens per problem.
Actual scenarios:
Algorithm optimization sandbox: drives innovation in the Reinforcement Learning from Human Feedback (RLHF) technology. For example, entropy regularization methods such as Clip-Cov and KL-Cov effectively prevent entropy collapse in reinforcement learning.
5. TheoremQA
Introduction:
TheoremQA is a theorem-driven question answering benchmark. It is an evaluation data set focused on mathematical theorem inference, covering more than 350 high-difficulty problems in areas such as geometry, number theory, and calculus. Its unique value lies in requiring models to perform inference strictly using mathematical theorems, formulas, and proof logic, rather than relying on statistical pattern matching, to directly test the strictness of models in mathematical inference.
Application:
TheoremQA is used to evaluate the capabilities of LLMs in the application of complex mathematical theorems, symbolic inference, and formula derivation.
Actual scenarios:
TheoremQA is suitable for developing academic research tools (such as verifying solutions to math problems) and educational intelligent systems (such as advanced math tutoring), and optimizing logical inference modules (such as enhancing the symbolic computation capabilities of Mixture of Experts (MoE)-based architectures).
Knowledge Q&A
1. MMLU
Introduction:
Massive Multitask Language Understanding (MMLU) is a data set covering a wide range of subjects. It contains 15,908 problems to test the interdisciplinary comprehensive understanding and inference capabilities of models.
Application:
MMLU is used to evaluate the interdisciplinary comprehensive understanding and inference capabilities of models, covering fields such as humanity, social science, and Science, Technology, Engineering, and Mathematics (STEM).
Actual scenarios:
In the development of intelligent assistants, MMLU can be used to train a model to answer a wide range of your questions, such as those about historical events, scientific principles, and literary works.
In academic research, MMLU is an important tool for testing whether a model has a broad knowledge base and interdisciplinary inference capabilities.
2. ARC-c
Introduction:
AI2 Reasoning Challenge - Challenge Set (ARC-c) is a science question-answering data set containing 2,590 challenging problems. It is used to test the deep inference capabilities of a model in terms of complex scientific questions.
Application:
ARC-c is used to evaluate the deep inference capabilities of a model in solving complex scientific problems, such as high-difficulty problems in fields such as physics, chemistry, and biology.
Actual scenarios:
In scientific research, ARC-c can be used to test whether a model can understand and solve complex scientific problems, such as explaining physical phenomena or deriving chemical reactions.
In the education field, ARC-c can be used to develop advanced science learning tools to help students understand complex scientific concepts easily.
3. ARC-e
Introduction:
AI2 Reasoning Challenge - Easy Set (ARC-e) is a simplified version of ARC-c, containing 5,197 straightforward problems. It is mainly used to test the understanding of fundamental scientific knowledge and the simple inference capabilities of models.
Application:
ARC-e is used to evaluate the understanding of fundamental scientific knowledge and simple inference capabilities of a model, such as explaining fundamental scientific principles or answering simple science questions.
Actual scenarios:
In the education field, ARC-e can be used to develop science learning tools for students from elementary school and middle school, helping them understand fundamental scientific knowledge.
In the science popularization field, ARC-e can be used to train intelligent assistants to answer simple science questions posed by the public.
4. TruthfulQA
Introduction:
TruthfulQA (truthful QA benchmark) comprises 817 questions designed to elicit imitative falsehoods, covering hallucination-prone fields such as health, law, and history. It can directly evaluate the truthfulness of model outputs by testing the factual accuracy and resistance to misinformation in adversarial questions of a model.
Application:
TruthfulQA is used to quantify the truthfulness of content generated by a model and its ability to identify implicit erroneous assumptions.
Actual scenarios:
TruthfulQA is suitable for developing fact-checking systems (such as automatically identifying contradictory statements), enhancing search engines (such as reliability grading), and designing safe mechanisms (such as reducing harmful, misleading outputs).
Language Understanding and Generation
1. Hellaswag
Introduction:
Hellaswag is a commonsense inference data set containing 70,000 problems. It is designed to test the contextual understanding capabilities of models, particularly their comprehension and inference capabilities in daily scenarios.
Application:
Hellaswag is used to evaluate the commonsense inference and contextual understanding capabilities of models, such as understanding the causal relationships in daily scenarios or predicting what might happen next.
Actual scenarios:
In the development of intelligent assistants, Hellaswag can be used to train a model to better understand your daily chat, such as answering questions like What should I bring if it rains?.
In the chatbot field, Hellaswag can be used to enhance a chatbot's decision-making capabilities in daily environments.
2. Xsum
Introduction:
Xsum is an extreme summarization data set containing 226,711 news articles and their corresponding summaries. It requires a model to generate concise summaries from long texts to test its summary generation capabilities.
Application:
Xsum is used to evaluate the capabilities of a model in generating concise summaries, and it requires a model to extract key information from long texts and generate brief summaries.
Actual scenarios:
In the field of news media, Xsum can be used to develop automatic summarization tools to help journalists or editors quickly generate news summaries.
In academic research, Xsum is an important benchmark for testing whether a model can accurately extract the core information from a text.
3. TyDiQA
Introduction:
TyDiQA is a multilingual question-answering data set containing 19,000 problems. It is used to test the multilingual reading comprehension and question answering capabilities of models.
Application:
TyDiQA is used to evaluate the capabilities of a model in multilingual reading comprehension and question answering, covering 11 languages.
Actual scenarios:
In global applications, TyDiQA can be used to develop multilingual intelligent assistants, helping you obtain information in different languages.
In cross-cultural communication, TyDiQA can be used to train translation tools or cross-linguistic information search systems.
4. Winogrande
Introduction:
Winogrande is a large-scale commonsense inference benchmark containing 44,000 pronoun resolution problems. After the Winograd Schema Challenge (WSC) is manually rewritten, Winogrande will become a larger-scale data set and can avoid bias. It focuses on testing the deep semantic understanding of entity reference relationships of a model in daily contexts, rather than superficial pattern recognition.
Application:
Winogrande is used to evaluate the inference capabilities of a model in terms of language context, commonsense logic, and entity association.
Actual scenarios:
Winogrande is suitable for optimizing chat systems (such as pronoun resolution and coreference modules), developing accessibility technologies (such as semantic enhancement for texts), and making research in cognitive linguistics (such as comparison between human and AI inference mechanisms).
5. IFEval
Introduction:
Instruction Following Eval (IFEval) is an evaluation set that focuses on the fine-grained execution capabilities of LLMs for complex instructions. It contains over 500 human-written instructions and 25 verifiable execution criteria (for example, output strictly step by step and avoid adding explanations). It directly quantifies the adherence of a model to your explicit or implicit constraints.
Application:
IFEval is used to evaluate the accuracy of a model in following instructions, fine-grained control capabilities, and output reliability.
Actual scenarios:
IFEval is suitable for optimizing intelligent assistants (such as accurately responding to complex requests), developing automated processes (such as precisely generating formatted outputs), and conducting safety alignment research (such as controllability verification).
Complex Inference and Comprehensive Capabilities
1. BBH
Introduction:
BIG-Bench Hard (BBH) is a subset of the BIG-Bench data set, and it comprises 23 complex inference tasks that focus on testing the complex inference capabilities of models.
Application:
BBH is used to evaluate the complex inference capabilities of models, such as logical inference, mathematical inference, and linguistic inference.
Actual scenarios:
In AI research, BBH is an important benchmark for testing whether models have advanced inference capabilities, such as tackling logic puzzles or completing complex tasks.
In the business field, BBH can be used to develop intelligent decision-making systems, helping enterprises analyze complex data and make informed decisions.
2. GPQA Diamond
Introduction:
Graduate-Level Google-Proof Q&A Diamond (GPQA Diamond) is a benchmark data set that focuses on testing the deep inference and professional knowledge application capabilities of models on PhD-level scientific questions. It comprises 198 high-difficulty interdisciplinary questions in fields such as biology, chemistry, and physics. It is designed to evaluate whether models have complex inference capabilities similar to domain experts, rather than simple information search or recall.
Application:
GPQA Diamond is used to evaluate the deep inference capabilities of models on PhD-level scientific questions.
Actual scenarios:
GPQA Diamond can be used to assist in scientific research (such as answering interdisciplinary questions), develop advanced science education tools (such as testing complex concepts), and optimize models (such as enhancing the inference performance of MoE-based architectures).
3. TEva
Introduction:
Text Evaluation (TEval) is a benchmark for evaluating the multi-task capabilities of Chinese LLMs, covering more than 30 task types and more than 500 fine-grained ability dimensions (such as classical Chinese writing, legal clause analysis, and multi-hop inference). Its characteristics lie in integrating academic exam questions, professional scenario questions, and adversarial samples, providing a comprehensive evaluation of Chinese semantic understanding and generation capabilities.
Application:
TEval is used to comprehensively evaluate the performance of Chinese models in professional fields, complex inference, and cultural contexts.
Actual scenarios:
TEval is suitable for optimizing localized AI products (such as assistants in the finance or legal sectors), developing educational assessment tools (such as Chinese proficiency tests), and conducting comparative studies of multilingual models (such as analyzing differences between Chinese and English proficiency).
Code Generation
1. HumanEval
Introduction:
HumanEval is a data set designed to evaluate the code generation capabilities of models, and it contains 164 programming problems.
Application:
HumanEval is used to evaluate the code generation capabilities of models, and it requires models to generate correct code based on problem descriptions.
Actual scenarios:
In software development, HumanEval can be used to develop automated programming tools that help programmers quickly generate code snippets.
In the education field, HumanEval can be used to develop programming learning tools that help students practice writing code and check its correctness.
2. MBPP
Introduction:
Microsoft Billion-scale Paraphrase Dataset for Programming (MBPP) is a large-scale data set of 1,000 programming problems that test the code generation and programming problem-solving capabilities of models.
Application:
MBPP is used to evaluate the code generation and programming problem-solving capabilities of models. It covers multiple programming languages and problem types.
Actual scenarios:
In programming education, MBPP can be used to develop intelligent programming tutoring systems that help students solve programming problems and provide feedback.
In software development, MBPP can be used to train models to automatically generate code or optimize existing code.
Appendix
To improve the quality of model answers and parse these answers, we will add system prompts accordingly when we evaluate the following built-in evaluation sets:
The following table shows the system prompts added based on the language types of problems:
|
Single-choice question | You are an AI assistant that provides accurate answers. You are asked to answer a multiple-choice question. Please explain the question, analyze each option one by one, and then give your answer. Please end your answer with "Therefore, the answer is" followed by the letter representing the option you select. | You are a precise, user-friendly, and self-contained assistant. You are a diligent assistant. You are an error-free and error-tolerant assistant. You are a helpful AI assistant. You are an advanced, unsurpassed, and pattern-recognizing assistant. You are required to answer a single-choice question. Give your answer after you explain how to answer the question. Please reply with the ending “Therefore, the answer is” together with a letter representing your choice. You are an advanced, unsurpassed, and pattern-recognizing assistant. |
Multiple-choice question | You are an AI assistant that provides accurate answers. You are asked to answer a multiple-choice question. Please answer directly and clearly with only one or more letters you select, without providing any explanation. | Currently, the above-mentioned evaluation sets do not include multiple-choice questions in English. |