Automated Evaluation

Download

포커스 모드

폰트 크기

마지막 업데이트 시간: 2026-01-23 16:59:48

Overview
Automated evaluation provides a wizard-based approach for submitting evaluation tasks to build evaluation tasks. It allows you to quickly conduct general effectiveness tests based on the built-in evaluation sets of the platform or start evaluation tasks by using custom evaluation sets and evaluation metrics. Its detailed features are as follows:
Entry 1: creating an automated evaluation task under Task-based Modeling > CheckPoint
You can conduct a lightweight evaluation on the checkpoint of a training task. To enable automated evaluation, you only need to configure the basic task information and select a built-in evaluation set provided by the platform.
You can view evaluation results.
Entry 2: creating an automated evaluation task on the Automated Evaluation tab
You can quickly create an evaluation task through built-in open-source evaluation sets and automatic metrics (such as pass@1, ROUGE, and F1).
You can initiate an evaluation task by using custom evaluation sets and evaluation metrics. Three evaluation modes are supported: Evaluation Only, Inference and Evaluation, and Custom Evaluation.
Evaluation Mode
Mode Description
Metric Configuration and Result Output
Evaluation Only
You can upload an evaluation set with model inference results and complete the scoring feature in the Automated Evaluation module.
In the two modes, you can customize evaluation metrics, debug metrics, and view the overall results and a specific evaluation result:
Customize metrics:
When you customize evaluation metrics, you need to configure the scoring method of each metric. For example, when you use a judge model for scoring, you need to configure the judge model and scoring prompts, and customize preprocessing and postprocessing scripts to process inputs and outputs to obtain metric results.
Debug metrics:
Before you formally initiate an evaluation task, you can conduct a small number of evaluations on the prediction sample. During debugging, you can adjust scoring prompts and preprocessing and postprocessing scripts to achieve the expected evaluation results.
View the overall results:
You can view the evaluation results of each model on each evaluation set.
View a specific evaluation result:
You can view the result of each scoring step performed on each piece of evaluation data.
Inference and Evaluation
When you upload an evaluation set containing only queries (questions), you can output the inference results and complete the scoring in the Automated Evaluation module.
﻿
Custom Evaluation (This feature is in beta testing.)
You can perform evaluations through custom evaluation images. Evaluation sets, custom images, storage mounts, and other elements can be combined into a task configuration. Each configuration includes evaluation sets, image selection, version selection, mount path settings, startup commands, parameter settings, and environment variables. By selecting images and versions only or selecting images and versions and then configuring mount paths, you can implement evaluations using images only or using images and mount paths.
You can customize evaluation metrics and output evaluation results using images or by mounting evaluation scripts. For specific usage, see 2.2 Configuring a Task.
Prerequisites
When you create an automated evaluation task, you need to prepare an evaluation set (use a built-in set or upload a custom set) and the model or service to be evaluated.
Preparing an Evaluation Set
Using a built-in evaluation set
    For specific built-in evaluation sets, see Introduction to Built-in Open-Source Evaluation Sets.
Uploading a custom evaluation set
Unlike the method for directly using a built-in open-source evaluation set of the platform, after you prepare an evaluation set, you need to enter the path of the evaluation set in Cloud File Storage (CFS)/GooseFSx/data sources when you create a task.
﻿
﻿
﻿
﻿
Download and decompress the objective data set of OpenCompass.
Copy the decompressed objective data set of OpenCompass to the local objective data set folder.
During evaluation, enter the data set on your CFS instance, for example, /test_data/xsum.
Models or Services to Be Evaluated
Models to be evaluated
Built-in LLMs
Certain built-in LLMs in Model Hub supported by the Automated Evaluation module
Self-owned LLMs
After you prepare the model to be evaluated, you need to enter the path of the evaluation set in CFS/GooseFSx/data sources when you create a task. For details on how to obtain the path, see the instructions in Uploading a Custom Evaluation Set under Preparing an Evaluation Set mentioned above.
Models or services to be evaluated
Select models from Online Services.
During evaluation, you can select models to be evaluated from Online Services of Tencent Cloud TI-ONE Platform (TI-ONE). You need to start the model to be evaluated as a service on TI-ONE in advance. For details on the deployment guide, see Online Services Deployment.
Enter a service address.
You can enter a service address during evaluation. You need to prepare the service in advance and record its address.
Creating an Evaluation Task
There are 2 entries for creating an evaluation task. You can create an automated evaluation task under Task-based Modeling > CheckPoint or on the Model Evaluation tab.
Entry 1: Creating an Automated Evaluation Task Under Task-based Modeling > CheckPoint
Overview: Log in to the TI-ONE console, and choose Training Workshop > Task-based Modeling in the left sidebar to go to the task list page. On the task page, click the task name to go to the task details page. The prerequisite for using this entry is that you have created a task through Task-based Modeling and completed the checkpoint output.
1. Configuring the Basic Information and an Evaluation Set
Click the CheckPoint tab and select the checkpoint card on which you want to conduct a quick test to preview the model effect.
﻿
﻿
Click Automated Evaluation. In the pop-up window, enter the task name, and select a built-in evaluation set of the platform and the required resources. Both pay-as-you-go and yearly/monthly subscription billing modes are supported. Currently, you can only perform evaluations on built-in evaluation sets.
﻿
2. Viewing Evaluation Results
After entering the information, click Create. The task will enter the Automated Evaluation - Inference in Progress status. Please wait patiently.
﻿﻿
Click View Automated Evaluation Results after the automated evaluation is completed.
﻿
﻿
﻿
Entry 2: Creating an Automated Evaluation Task on the Automated Evaluation Tab
Overview: Log in to the TI-ONE console, choose Model Services > Model Evaluation in the left sidebar, and then click the Automated Evaluation tab to go to the task list page. Then click Create Task to go to the task creation page.
To create an automated evaluation task in the Evaluation Only mode or Inference and Evaluation mode, perform two steps: Step 1 is to configure a basic task. In this step, you need to select the evaluation set, the model to be evaluated, and the evaluation resources. Step 2 is to configure and debug metrics. In this step, you need to customize evaluation metrics and their configurations, perform debugging before task submission, and initially view results.
To create an automated evaluation task in the Custom Evaluation mode, you need to complete the basic task configurations and select the model to be evaluated along with the evaluation resources.
﻿
﻿
1. Evaluation Only and Inference and Evaluation Modes
1.1 Configuring a Basic Task
Parameter
Description
Task Name
Name of an automated evaluation task. You can enter it based on the rules according to the interface prompts.
Remarks
Add remarks to a task as needed.
Region
Services under the same account are isolated by region. The value of the Region field is automatically entered based on the region you selected on the service list page.
Tag
Used for permission isolation between evaluation tasks.
Billing
Billing mode: You can select the pay-as-you-go mode or yearly/monthly subscription (resource group) mode only when the evaluation mode is Inference and Evaluation:
In the pay-as-you-go mode, you do not need to purchase a resource group in advance. Fees are charged based on the CVM instance specifications on which the service depends. When the service is started, fees for the first two hours are frozen. After that, fees are charged hourly based on the number of running instances.
In the yearly/monthly subscription (resource group) mode, you can use the resource group deployment service purchased from the Resource Group Management module. Since computing resource fees are already paid when the resource group is purchased, no fees need to be charged when the service is started.
Resource group: If you select the yearly/monthly subscription (resource group) mode, you can select a resource group from the Resource Group Management module.
1.2 Configuring an Evaluation Set
When the evaluation mode is Evaluation Only or Inference and Evaluation, you need to configure an evaluation set to perform evaluations. The evaluation set should be selected from built-in evaluation sets/Data Center/CFS/GooseFSx/data sources/resource groups. You can configure the parameters of an evaluation set and preview the configured evaluation set.
Select an evaluation set source.
If you select a built-in evaluation set, you can directly enable the evaluation with one click.
If you select an evaluation set from Data Center, you need to choose Platform Data Center > Data Set to mount the corresponding data set first. For evaluation sets in Data Center, you can filter them by Business Tag to improve the efficiency for selecting an evaluation set.
If you select an evaluation set from data sources, you need to choose Platform Management > Data Source Management to create a data source. Note: The data source mount permissions are divided into read-only mount and read-write mount. For a data source for which you need to output training results, you can configure its mount permission to read-write.
If you select an evaluation set from CFS or GooseFSx, you need to select the CFS instance or GooseFSx instance from the drop-down list and enter the data source directory to be mounted by the platform. The last layer of the path is the folder directory, for example, /test_data/ceval. When the directory is mounted, you need to ensure that the data set in this directory is the one to be evaluated. Otherwise, the evaluation will fail. Currently, you cannot specify a specific file name when you enter the path.
﻿
Configure the parameters of an evaluation set and preview the configured evaluation set.
Configure parameters: You can configure inference hyperparameters and judge model scoring parameters in batches for an evaluation set.
          The inference hyperparameters are described as follows:
repetition_penalty: controls the repetition penalty.
max_tokens: controls the maximum length of the output text.
temperature: A higher temperature makes outputs more random; a lower temperature makes outputs more focused and deterministic.
top_p and top_k: control the diversity of the output text. Higher values produce more diverse outputs. It is recommended to configure only 1 of the temperature, top_p, and top_k parameters.
do_sample: specifies the sampling method for model inference. When this parameter is set to true, the sample method is used; when this parameter is set to false, the greedy search method is used, and the top_p, top_k, temperature, and repetition_penalty parameters do not take effect.
The judge model scoring parameters are described as follows:
MAX_JUDGING_CONCURRENCY: indicates the maximum number of requests that can be sent to a judge model simultaneously during scoring after each model to be evaluated has completed inference on the current data set. Setting this value too low may cause the throughput of a model to decrease and lead to a long evaluation time, while setting this value too high may cause request timeout issues.
MAX_JUDGING_RETRY_PER_Q: indicates the maximum number of retries for each piece of data when an exception occurs during scoring, such as a network failure or a scoring request queuing timeout. If the value is 0, no retry is performed. Note that too many retries may result in a longer evaluation time.
           INFERENCE_COUNT: number of inference data entries generated by default.
Data Preview: You can preview the selected data set.
﻿
1.3 Configuring a Model or Service to Be Evaluated
You need to select the model or service to be evaluated to generate inference results for evaluation. Sources for models to be evaluated include built-in LLMs/training tasks/CFS/GooseFSx/data sources/resource groups, while sources for model services include online services/fill-in service addresses. When you configure models/services to be evaluated, the parameter setting feature is provided.
Select a model:
If you select a built-in LLM, you can quickly enable an evaluation.
If you select a model from training tasks, you can select a training task in the current region or the checkpoint of the task.
If you select a model from data sources, you need to choose Platform Management > Data Source Management to create a data source. Note: The data source mount permissions are divided into read-only mount and read-write mount. For a data source for which you need to output results, you can set its mount permission to read-write.
If you select a model from CFS or GooseFSx, you need to select the CFS instance or GooseFSx instance from the drop-down list and enter the directory to be mounted by the platform.
Select a service:
If you select a service from Online Services, you need to select the online service name. To ensure that the evaluation can be initiated successfully, make sure that the service is running normally. In addition, you need to select the corresponding authentication information.
If you choose to enter a service address, you need to enter the complete service call URL along with the corresponding authentication information. If you use this method to select a service, you can check whether the service can run normally through the connectivity test.
Configure parameters: You can configure inference hyperparameters, startup parameters, and performance parameters.
Configure inference hyperparameters as follows:
repetition_penalty: controls the repetition penalty.
max_tokens: controls the maximum length of the output text.
temperature: A higher temperature makes outputs more random; a lower temperature makes outputs more focused and deterministic.
top_p and top_k: control the diversity of the output text. Higher values produce more diverse outputs. It is recommended to configure only 1 of the temperature, top_p, and top_k parameters.
do_sample: specifies the sampling method for model inference. When this parameter is set to true, the sample method is used; when this parameter is set to false, the greedy search method is used, and the top_p, top_k, temperature, and repetition_penalty parameters do not take effect.
Configure startup parameters: More details in Service Deployment Parameter Configuration Guide. MAX_MODEL_LEN is the default parameter configured on the platform, which specifies the maximum number of tokens a model can process in a single inference operation. Its default value is 8192 on the platform. If you set this parameter to a very high value upon startup, GPU out-of-memory or performance degradation issues may occur. You can adjust this value appropriately based on task requirements.
Configure performance parameters: MAX_CONCURRENCY and MAX_RETRY_PER_QUERY are the default parameters configured on the platform.
MAX_CONCURRENCY indicates the maximum number of requests that can be sent to a model simultaneously during evaluation. Setting this value too low may cause the throughput of a model to decrease and lead to a long evaluation time, while setting this value too high may cause GPU out-of-memory or request timeout issues. Its default value is 24 on the platform. You can adjust this value appropriately based on task requirements.
MAX_RETRY_PER_QUERY indicates the maximum number of retries for each piece of data when an exception occurs in requesting the inference service, such as a request timeout or a network failure. If the value is 0, no retry is performed (default value: 0). You can adjust this value appropriately based on task requirements.
You can reuse previous evaluation results.
When you select a model to be evaluated, you can select the evaluation results of a previous evaluation task with the same model name and the same evaluation set name.
﻿
    Enable the reuse switch and select the evaluation tasks to be reused.
﻿
    Directly reuse results without the need to perform inference and scoring again.
﻿
1.4 Configuring Metrics
For built-in evaluation sets, you do not need to configure metrics because the platform provides the default metrics for automated evaluation. For custom evaluation sets, you need to configure metrics. You can use the judge model scoring method to customize these metrics. In addition, you can configure preprocessing and postprocessing scripts to highly customize the inputs and outputs of the data to achieve the desired evaluation results.
Add metrics by clicking +Metric or +Copy Existing Metric. After metrics are added successfully, you need to configure them. During configuration, you can customize an evaluation process. The evaluation process includes preprocessing, judge model scoring, and postprocessing. For usage instructions, see Prompt for Judge Model Scoring and Preprocessing/Postprocessing Format Requirements.
﻿
Configure evaluation metrics quickly through Quick Configuration. If you use the quick configuration feature, you need to upload files containing an evaluation set, metric names corresponding to the evaluation set, and detailed configuration information (such as the judge model information, scoring prompts, and preprocessing and postprocessing scripts) corresponding to each metric name. You can click Quick Configuration and upload your custom YAML configuration file and the files that need to be referenced. After the upload is completed, you can click Apply. The platform will automatically populate the information on the page based on the metrics in the YAML configuration file corresponding to the evaluation set name you entered, as well as the configuration information of these metrics. For details on how to upload files when you use the quick configuration feature, see Quick Configuration File Specifications for Metrics.
﻿
1.5 Debugging Metrics
For metrics configured for custom evaluation sets, you can debug them before a formal evaluation is initiated. You can view automatically generated responses (inference results) and scoring results to determine whether the scoring prompts and processing scripts need to be adjusted. During debugging, you can select a configured model service or model to be evaluated.
Generate the answer from the model to be evaluated.
Select a service to be evaluated.
Click Generate Response to automatically generate the response column.
﻿
Select a model to be evaluated.
Manually enter response. The manually entered response will be automatically scored.
﻿
Generate scoring results.
Click Generate Scoring Results to automatically generate the scoring result column. You can hover your mouse pointer to view the results of each scoring step.
﻿
View a concatenated prompt.
Click View Concatenated Prompt to view the effect of the concatenated prompt.
﻿
1.6 Viewing Evaluation Results
After you enter the above information, click Finish to start the evaluation task. After an evaluation task is successfully created, the following information will be displayed on the task list page: Task Name, CVM Instance Source, Evaluation Resources, Progress, Tag, Creator, Creation Time, and Operations (Stop, Restart, Delete, and Copy).
Overall evaluation progress
During the evaluation process, you can view the progress:
﻿
Note:
Formulas for calculating the progress: Total number of evaluation (inference/scoring) data entries = Number of models x Number of evaluation data entries. Progress N% = Number of evaluated (inference/scoring) data entries/Total number of evaluation (inference/scoring) data entries%.
During the evaluation process, if any data fails to be evaluated, the remaining data will continue to be evaluated until all the data has been evaluated.
During the evaluation process, click View Progress to view the detailed progress:
﻿
Overall evaluation results
After the evaluation is completed, you can click the Overall Evaluation Results tab to view the detailed progress of the overall evaluation and its final evaluation results. You can view the comprehensive scores and rankings of the models, as well as detailed metrics for each model. You can also click Adjust Weights to adjust the weights of evaluation sets and metrics.
﻿
Display of a single result
You can view scoring results for individual data items in steps. You can select a reference model to compare the responses and scoring effects of the model to be evaluated and the reference answers.
﻿
Log viewing
You can view evaluation logs.
﻿
2. Custom Evaluation
2.1 Configuring a Basic Task
Parameter
Description
Task Name
Name of an automated evaluation task. You can enter it based on the rules according to the interface prompts.
Remarks
Add remarks to a task as needed.
Region
Services under the same account are isolated by region. The value of the Region field is automatically entered based on the region you selected on the service list page.
Tag
It is used for permission isolation between evaluation tasks. Meanwhile, once task tags are configured, the platform will automatically add them to the evaluation configuration.
Billing
Billing mode: You can select the pay-as-you-go mode or yearly/monthly subscription (resource group) mode only when the evaluation mode is inference and evaluation:
In the pay-as-you-go mode, you do not need to purchase a resource group in advance. Fees are charged based on the CVM instance specifications on which the service depends. When the service is started, fees for the first two hours are frozen. After that, fees are charged hourly based on the number of running instances.
In the yearly/monthly subscription (resource group) mode, you can use the resource group deployment service purchased from the Resource Group Management module. Since computing resource fees are already paid when the resource group is purchased, no fees need to be charged when the service is started.
Resource group: If you select the yearly/monthly subscription (resource group) mode, you can select a resource group from the Resource Group Management module.
Task resource request: Different tasks require different resource configurations. When multiple configurations exist, it is recommended to specify the maximum resource requirements.
Model resource request: When the selected model to be evaluated is a model, you need to specify the model deployment resources according to the actual situation. When multiple models exist, it is recommended to specify the maximum configuration resources.
2.2 Configuring a Task
Evaluation sets, custom images, storage mounts, and other elements can be combined into a task configuration. Each configuration includes evaluation sets, image selection, version selection, mount path settings, startup commands, parameter settings, and environment variables. By selecting images and versions only or selecting images and versions and then configuring mount paths, you can implement evaluations using images only or using images and mount paths.
﻿
Below are the use instructions on task configuration:
Build a custom image containing evaluation sets and processing logic.
You can configure and package the basic environment into the image according to the README of the relevant open-source data sets.
Platform environment variables.
The platform will automatically inject the following 4 environment variables into your image, as follows:
Environment Variable
Description
Remarks
Value
EVAL_OUTPUT_DIR
Result output directory
The evaluation results must be saved to this directory in the image for result display/visual comparison.
/opt/ml/output
EVAL_INFERENCE_URL
Model API address
Since a task configuration may evaluate multiple model services, the platform passes the call information of model services through this environment variable.
https://api.openai.com/v1/chat/completions
EVAL_AUTHORIZATION_HEADER_KEY
Authentication header name
﻿
Authorization
EVAL_AUTHORIZATION_HEADER_VALUE
Authentication header value
﻿
Bearer sk-xxx...
You can also inject configurations such as judge models and inference results into the execution environment via environment variables or storage mounts to facilitate calls, as shown in the following example:
﻿
2.3 Configuring a Model/Service to Be Evaluated
Call according to the relevant environment variables in section 2.2. The platform will pass information about the selected online services/third-party URLs during execution. You need to select the model or service to be evaluated to generate inference results for evaluation. Sources for models to be evaluated include built-in LLMs/training tasks/CFS/GooseFSx/data sources/resource groups, while sources for model services include Online Service/Specify Service Addresses. When users configure models/services to be evaluated, the parameter setting feature is provided.
Select a model:
If you select a built-in LLM, you can quickly enable an evaluation.
If you select a model from training tasks, you can select a training task in the current region or the checkpoint of the task.
If you select a model from data sources, you need to choose Platform Management > Data Source Management to create a data source. Note: The data source mount permissions are divided into read-only mount and read-write mount. For a data source for which you need to output results, you can set its mount permission to read-write.
If you select a model from CFS or GooseFSx, you need to select the CFS instance or GooseFSx instance from the drop-down list and enter the directory to be mounted by the platform.
Select a service:
If you select a service from Online Services, you need to select the online service name. To ensure that the evaluation can be initiated successfully, make sure that the service is running normally. In addition, you need to select the corresponding authentication information.
If you choose to enter a service address, you need to enter the complete service call URL along with the corresponding authentication information. If you use this method to select a service, you can check whether the service can run normally through the connectivity test.
Configure parameters: You can configure inference hyperparameters, startup parameters, and performance parameters.
Configure inference hyperparameters as follows:
repetition_penalty: controls the repetition penalty.
max_tokens: controls the maximum length of the output text.
temperature: A higher temperature makes outputs more random; a lower temperature makes outputs more focused and deterministic.
top_p and top_k: control the diversity of the output text. Higher values produce more diverse outputs. It is recommended to configure only 1 of the temperature, top_p, and top_k parameters.
do_sample: specifies the sampling method for model inference. When this parameter is set to true, the sample method is used; when this parameter is set to false, the greedy search method is used, and the top_p, top_k, temperature, and repetition_penalty parameters do not take effect.
Configure startup parameters: more details in Service Deployment Parameter Configuration Guide. MAX_MODEL_LEN is the default parameter configured on the platform, which specifies the maximum number of tokens a model can process in a single inference operation. Its default value is 8192 on the platform. If you set this parameter to a very high value upon startup, GPU out-of-memory or performance degradation issues may occur. You can adjust this value appropriately based on task requirements.
Configure performance parameters: MAX_CONCURRENCY and MAX_RETRY_PER_QUERY are the default parameters configured on the platform.
MAX_CONCURRENCY indicates the maximum number of requests that can be sent to a model simultaneously during evaluation. Setting this value too low may cause the throughput of a model to decrease and lead to a long evaluation time, while setting this value too high may cause GPU out-of-memory or request timeout issues. Its default value is 24 on the platform. You can adjust this value appropriately based on task requirements.
MAX_RETRY_PER_QUERY indicates the maximum number of retries for each piece of data when an exception occurs in requesting the inference service, such as a request timeout or a network failure. If the value is 0, no retry is performed (default value: 0). You can adjust this value appropriately based on task requirements.
You can reuse previous evaluation results.
Note:
Ensure that the path for outputting metrics to the platform is configured in the custom evaluation. Otherwise, task metrics cannot be obtained.
When you select a model to be evaluated, you can select the evaluation results of a previous evaluation task with the same model name and the same evaluation set name.
﻿
Enable the reuse switch and select the evaluation tasks to be reused.
﻿
The results can be directly reused without performing inference and scoring again.
﻿
2.4 Viewing Evaluation Results
Note:
Execute the following steps to view evaluation results. After completing the following steps, you can view metrics in the overall evaluation results. Before executing the steps, you can only view evaluation results via custom image logs, and you cannot select the evaluation results for this evaluation set on relevant models in the visual comparison.
To view the evaluation results for this data set on relevant models on the details page of custom evaluation results, you need to store the specific evaluation results in the platform path ${EVAL_OUTPUT_DIR}. The name of the file is metrics.json, and its format is as follows:
{
  "metrics": [
    {
      "name": "accuracy",
      "value": 0.85
    },
    {
      "name": "f1_score",
      "value": 0.78
    }
  ]
}
You can write the file in the custom image via the following code:
import json
import os
 
# Obtain the output directory.
output_dir = os.getenv('EVAL_OUTPUT_DIR')
 
# Construct the result data.
result = {
    "metrics": [
        {"name": "accuracy", "value": 0.85},
        {"name": "f1_score", "value": 0.78}
    ]
}
 
# Write the result file.
result_file = os.path.join(output_dir, 'metrics.json')
with open(result_file, 'w', encoding='utf-8') as f:
    json.dump(result, f, indent=2, ensure_ascii=False)
Click the Overall Evaluation Results tab to view the detailed progress and final results of the overall evaluation. Viewing individual evaluation results is currently not supported.
Overall evaluation progress
Note:
In the custom evaluation mode, the platform cannot obtain the number of evaluation sets. The calculation formula for the current progress is: Progress = Completed evaluation tasks (Models x Evaluated evaluation sets)/Total evaluation tasks (Models x Total evaluation sets). The evaluation progress can remain unchanged for an extended period due to its large granularity, but this does not mean that the evaluation has paused or failed. You can view the progress via logs.
﻿
Overall evaluation results
You can view the comprehensive scores and rankings of the models, as well as detailed metrics for each model. You can also click Adjust Weights to adjust the weights of evaluation sets and metrics.
﻿
Log viewing
You can view evaluation logs.
﻿
2.5 Custom Evaluation Configuration Example (BFCL Evaluation Set)
The task configuration for each evaluation set is executed serially on each model service.
Background: Evaluation images are customized to perform evaluations by mounting BFCL evaluation sets.
Task configuration: You need to customize evaluation images, storage mounts, environment variables, and startup commands of the BFCL evaluation set.
﻿
﻿

도움말 및 지원

문제 해결에 도움이 되었나요?

더 자세한 내용은 문의하기 또는 티겟 제출 을 통해 문의할 수 있습니다.

피드백

tencent cloud

Tencent Cloud TI Platform

Automated Evaluation

Overview

Prerequisites

Preparing an Evaluation Set

Models or Services to Be Evaluated

Creating an Evaluation Task

Entry 1: Creating an Automated Evaluation Task Under Task-based Modeling > CheckPoint

1. Configuring the Basic Information and an Evaluation Set

2. Viewing Evaluation Results

Entry 2: Creating an Automated Evaluation Task on the Automated Evaluation Tab

1. Evaluation Only and Inference and Evaluation Modes

1.1 Configuring a Basic Task

1.2 Configuring an Evaluation Set

1.3 Configuring a Model or Service to Be Evaluated

1.4 Configuring Metrics

1.5 Debugging Metrics

1.6 Viewing Evaluation Results

2. Custom Evaluation

2.1 Configuring a Basic Task

2.2 Configuring a Task

2.3 Configuring a Model/Service to Be Evaluated

2.4 Viewing Evaluation Results

2.5 Custom Evaluation Configuration Example (BFCL Evaluation Set)

도움말 및 지원

Evaluation Mode	Mode Description	Metric Configuration and Result Output
Evaluation Only	You can upload an evaluation set with model inference results and complete the scoring feature in the Automated Evaluation module.	In the two modes, you can customize evaluation metrics, debug metrics, and view the overall results and a specific evaluation result: Customize metrics: When you customize evaluation metrics, you need to configure the scoring method of each metric. For example, when you use a judge model for scoring, you need to configure the judge model and scoring prompts, and customize preprocessing and postprocessing scripts to process inputs and outputs to obtain metric results. Debug metrics: Before you formally initiate an evaluation task, you can conduct a small number of evaluations on the prediction sample. During debugging, you can adjust scoring prompts and preprocessing and postprocessing scripts to achieve the expected evaluation results. View the overall results: You can view the evaluation results of each model on each evaluation set. View a specific evaluation result: You can view the result of each scoring step performed on each piece of evaluation data.
Inference and Evaluation	When you upload an evaluation set containing only queries (questions), you can output the inference results and complete the scoring in the Automated Evaluation module.
Custom Evaluation (This feature is in beta testing.)	You can perform evaluations through custom evaluation images. Evaluation sets, custom images, storage mounts, and other elements can be combined into a task configuration. Each configuration includes evaluation sets, image selection, version selection, mount path settings, startup commands, parameter settings, and environment variables. By selecting images and versions only or selecting images and versions and then configuring mount paths, you can implement evaluations using images only or using images and mount paths.	You can customize evaluation metrics and output evaluation results using images or by mounting evaluation scripts. For specific usage, see 2.2 Configuring a Task.

Parameter	Description
Task Name	Name of an automated evaluation task. You can enter it based on the rules according to the interface prompts.
Remarks	Add remarks to a task as needed.
Region	Services under the same account are isolated by region. The value of the Region field is automatically entered based on the region you selected on the service list page.
Tag	Used for permission isolation between evaluation tasks.
Billing	Billing mode: You can select the pay-as-you-go mode or yearly/monthly subscription (resource group) mode only when the evaluation mode is Inference and Evaluation: In the pay-as-you-go mode, you do not need to purchase a resource group in advance. Fees are charged based on the CVM instance specifications on which the service depends. When the service is started, fees for the first two hours are frozen. After that, fees are charged hourly based on the number of running instances. In the yearly/monthly subscription (resource group) mode, you can use the resource group deployment service purchased from the Resource Group Management module. Since computing resource fees are already paid when the resource group is purchased, no fees need to be charged when the service is started. Resource group: If you select the yearly/monthly subscription (resource group) mode, you can select a resource group from the Resource Group Management module.

Environment Variable	Description	Remarks	Value
EVAL_OUTPUT_DIR	Result output directory	The evaluation results must be saved to this directory in the image for result display/visual comparison.	/opt/ml/output
EVAL_INFERENCE_URL	Model API address	Since a task configuration may evaluate multiple model services, the platform passes the call information of model services through this environment variable.	https://api.openai.com/v1/chat/completions
EVAL_AUTHORIZATION_HEADER_KEY	Authentication header name			Authorization
EVAL_AUTHORIZATION_HEADER_VALUE	Authentication header value			Bearer sk-xxx...