tencent cloud

Tencent Cloud TI Platform

Related Agreement
Kebijakan Privasi
Perjanjian Pemrosesan dan Keamanan Data
DokumentasiTencent Cloud TI Platform

Evaluation Set Format Requirements

Mode fokus
Ukuran font
Terakhir diperbarui: 2026-01-23 17:01:11
Model Evaluation supports both automated evaluation and manual evaluation. Currently, automated evaluation offers two modes: Inference and Evaluation and Evaluation Only.
Inference and Evaluation mode: upload an evaluation set (containing only questions and reference answers), and perform inference and complete judge model scoring in the Automated Evaluation functional module.
Evaluation Only mode: upload an evaluation set with model inference results (questions, reference answers, and inference results), and use the judge model for scoring in the Automated Evaluation functional module.
The following sections introduce the evaluation set format requirements of the Inference and Evaluation and Evaluation Only modes.

Inference and Evaluation

Input Parameter Description

Parameter Name
Required
Parameter Description
messages
Required
Main content of an evaluation chat, including system (system settings), role (assistant or user), and content.
ref_answer
Optional
Reference answer of a model.
Other Fields
Optional
User-defined fields.

File Formats

Only the JSONL format is supported. Each piece of data should be a valid JSON format. For example:
1. Multi-turn chat (excluding gt):

{
"messages": [
{
"role": "system",
"content": "Intelligent Assistant is an LLM developed by xxx. xxx is a Chinese technology company that has been dedicated to research related to LLMs."
},
{
"role": "user",
"content": "Hello!"
},
{
"role": "assistant",
"content": "Hello! Is there anything I can do for you"
},
{
"role": "user",
"content": "What is the sum of 1 and 1"
}
],
"ref_answer": "The answer is 2",
"max_tokens": 4096,
"extra_content": "a user-defined field. It can be referenced in a judge model"
}
2. Multi-turn chat (In the last sentence of messages, role is assistant, and the content will be automatically parsed as gt.):
{
"messages": [
{
"role": "system",
"content": "ht is an LLM developed by xxx. xxx is a Chinese technology company that has been dedicated to researches related to LLMs."
},
{
"role": "user",
"content": "Hello!"
},
{
"role": "assistant",
"content": "Hello! Is there anything I can do for you"
},
{
"role": "user",
"content": "What is the sum of 1 and 1"
},
{
"role": "assistant",
"content": "2"
}
],
"ref_answer": "The answer is 2",
"extra_content": "a user-defined field. It can be referenced in a judge model"
}
3. Single-turn chat

{
"messages": [
{
"role": "user",
"content": "What is the sum of 1 and 1"
}
],
"ref_answer": "The answer is 2",
"extra_content": "a user-defined field. It can be referenced in a judge model"
}
You can download evaluation set examples. For details, see Examples of Evaluation Sets in JSONL Format for Automated Evaluation.

In addition, the platform is compatible with the original formats for manual evaluation. Supported formats: JSONL and CSV. The evaluation set format is as follows:
For evaluation sets in JSONL format: The format for each piece (row) of data is as follows:
{"system": "You are helpful.", "conversation": [{"prompt": "712+165+223+711=","response": "1811"}]}
The system, prompt, and response fields correspond to the system input, prompt, and expected response respectively.
If a piece (row) of data does not have a system input, the following format can be used:
{"conversation": [{"prompt": "712+165+223+711=","response": "1811"}]}
For evaluation sets in CSV format: An evaluation set consists of 3 columns with column names system, prompt, and response respectively. The system field is optional. If a piece (row) of data does not have a system input, the corresponding position should be left blank.

Evaluation Only

Input Parameter Description

Parameter Name
Required
Parameter Description
messages
Required
Main content of an evaluation chat, including system (system settings), role (assistant or user), and content.
model_outputs
Required
Inference result of a model.
ref_answer
Optional
Reference answer of a model.
Other Fields
Optional
User-defined fields.

File Formats

Only the JSONL format is supported. Each piece of data should be a valid JSON format. For example:

{
"messages":
[
{
"role": "user",
"content": "Hello!"
},
{
"role": "assistant",
"content": "Hello! Is there anything I can do for you"
},
{
"role": "user",
"content": "What is the sum of 16 and 16"
}
],
"ref_answer": "The answer is 32",
"id": "141234314",
"max_tokens": 1024,
"temperature": 0.7,
"top_p": 0.9,
"top_k": 50,
"model_outputs": // Above are the fields in the original data set.
// This field is used to record the outputs of each model. The type is a list, and each object in the list contains the following:
// model_name: name of the model.
// responses: output of the model.
[
{
"model_name": "llama3",
"responses":
[
{
"content": "The answer is 32."
},
{
"content": "32 is the answer."
},
]
},
{
"model_name": "qwen3",
"responses":
[
{
"content": "The answer is 32.",
"reasoning_content": "Wait..."
},
{
"content": "The answer is 32.",
"reasoning_content": "Thinking..."
}
]
}
]
}



Detailed Description of the model_outputs Field

This field is a list used to store inference results of different models. Each element in the list is an object, representing all outputs of a model. The object contains the following:
model_name (string): name specified for the model to be tested.
responses (list): contains a list of one or more inference results of the model.
Each element in the list represents a complete inference output. The output contains the following:
content (string): final answer of the model.
reasoning_content (string, optional): thinking process before the model generates an answer.

Internal structure of the inference result object of the model (model_outputs):
Key Name
Type
Required
Description
model_name
String
Yes
Unique identifier of the model to be tested, for example, llama3-8b. Note: In the entire data set, the same name should be used for the same model so that the report can correctly summarize all the performance data of the model.
responses
Array
Yes
An array containing one or more inference results of the model. It supports scenarios where n is greater than 1. That is, inference is performed multiple times for the same input prompt.

Internal structure of the inference result object (responses):
Key Name
Type
Required
Description
content
String
Yes
Final result generated by the model to be tested. It is referenced during scoring based on a judge model.
reasoning_content
String
No
Used to store the thinking process, chain of thought, or any intermediate drafts before the model to be tested generates the final content. This content is provided to help with deeper analysis. If the model does not output this content, this key can be omitted.

Complete Example


// Line 1: contains the inference results of two models, where qwen2 has reasoning_content.
{"messages": [{"role": "user", "content": "What is the sum 16 and 16"}], "ref_answer": "The answer is 32", "id": "141234314", "model_outputs": [{"model_name": "llama3", "responses": [{"content": "The answer is 32."}]}, {"model_name": "qwen2", "responses": [{"content": "The answer is 32.", "reasoning_content": "The user is asking a simple math question. 16 + 16 equals 32."}]}]}
// Second line: contains two different inference results of a model (n=2).
{"messages": [{"role": "user", "content": "Write a poem praising programmers"}], "ref_answer": "Code is like poetry, expressing the beauty of logic and shaping a world using your fingertips.", "id": "141234315", "model_outputs": [{"model_name": "llama3", "responses": [{"content": "Ten lines of code can change the world."}, {"content": "Fingers dance and code likes a song."}]}]}


Bantuan dan Dukungan

Apakah halaman ini membantu?

masukan