The evaluation feature supports users in batch testing and assessing the performance and effect of Intelligent Agent applications. Currently, application evaluation supports two types of evaluation features:
Benchmark evaluation: Run applications in batches, output and evaluate, can be based on reference answers or custom content (import via custom column), support various scoring methods, including judge model, rule or code.
Comparative Evaluation: Configure multiple groups of models or prompt content in the same application for comparison, enabling rapid assessment of performance differences between different configurations on the same task.
Through application evaluation, users can continuously optimize the performance and experience of Intelligent Agent applications.
Enter the knowledge base QA application details page, click Application Evaluation to enter the application evaluation module. Features include Evaluation Set and Evaluation Task two parts, specific content will be introduced in the following context.
Evaluation Set
Evaluation set is a test dataset for batch testing application effectiveness. It supports unified management and can be used repeatedly subsequently.
Upload Evaluation Set
1. Click Evaluation Set to enter the evaluation set management page.
2. Click to upload Evaluation Set, a pop-up window will pop up. Please refer to the upload template to build the evaluation set file.
Note:
Evaluation Set File Rule:
Evaluation Set currently supports .xlsx format.
Each evaluation set file size cannot exceed 20 MB.
The evaluation set supports customization with up to 10 custom columns. Custom columns are used to store additional information needed during the evaluation process and can be referenced as variables within the rules. For example, when scoring based on reference answers, set a "reference_output" custom column to store the specific content of reference answers.
3. Click to upload the evaluation set file.
Managing Evaluation Sets
On the evaluation set management page, you can manage uploaded evaluation sets, including download and deletion.
Evaluation Task
Evaluation task is the core component of application evaluation. It supports user selection of different evaluation methods to start the task and manage the process.
Creating an Evaluation Task
Creating an evaluation task mainly consists of two steps: Benchmark Evaluation Configuration and Comparison Evaluation Configuration, which are explained in detail in the text.
Step 1: Benchmark Evaluation Configuration
Benchmark evaluation configuration supports selecting evaluation sets, running in batch based on application configuration to obtain results, and setting the scoring method, which supports multiple scoring methods such as judge model, rule, or code. Regardless of the scoring method selected, the results after running support batch download and manual annotation on the interface.
Note:
Name: The task name, up to 20 characters, cannot be repeated.
Evaluation Set: A collection of evaluation samples used to run batch evaluation tasks for applications. You can select uploaded evaluation sets, with support for multiple selections and a maximum of 5. The column names of selected evaluation sets must remain consistent, otherwise the system will prompt an error.
Scoring method: Select the scoring method for the evaluation set results after batch execution, including: judge model scoring, rule scoring, code scoring, or output result only without scoring.
1. Judge model scoring: Evaluate batch execution results of the evaluation set through a large model, consisting of two parts: judge model and judge prompt content.
Judge model: Select the model as the large model judge, support selecting different models, and configurable model parameters.
Judge prompt content: The model prompt content used for scoring evaluation results, supporting importing query, preset columns, custom columns, application output content and other system variables through shortcut keys. Please ensure the match between judge model prompt content and evaluation set content to enable the judge model to score correctly. For example, if scoring according to reference answers, you need to upload reference answers (such as reference_output) in the evaluation set. It also supports clicking Template to view more suitable prompt content in the template library.
Note:
Judge model scoring will consume tokens. Please verify the accuracy of prompt content strictly before task start to avoid unnecessary consumption.
2. Rule scoring: Score the output result according to predefined rules. Support individual rule and multiple rules composite.
3. Code scoring: Score the output result by writing code. Currently supports Python language. You can make a direct reference to query, preset columns, and custom columns in the evaluation set.
4. Output result only without scoring: Only perform batch execution of the evaluation set without any scoring, directly output the result. Selecting this method is equivalent to task execution batch processing.
Step 2: Comparison Evaluation Configuration
Comparative Evaluation feature is used for effect comparison of multiple groups of models or prompt content in the same application, helping quickly assess performance differences between different configurations on the same task. Supported comparison methods include:
No comparison evaluation
Multi-model comparison evaluation
Multi-prompt content comparison evaluation
The feature is only supported in standard mode applications. When performing a comparison evaluation, you can select the following processing methods for the comparison result.
Only generate multiple results without comparison or scoring.
Comparison and scoring: Use the judge model to score multiple results for each sample.
1. No comparison evaluation (default): Run according to the test environment configuration of this application without comparing models or prompt content.
2. Multi-model comparison evaluation: Support adding up to 2 groups of comparison models for comparison with the current configuration model. If the comparison model is identical to the current configuration model, the system will prompt an error.
3. Multi-prompt content comparison evaluation: Support adding up to 2 groups of comparison prompt content for comparison with the current configuration prompt content. If the comparison prompt content is identical to the current prompt content, the system will prompt an error.
After selecting the comparison method, you can choose whether to perform scoring for comparison
1. Only generate multiple results without comparison or scoring (default): Only output multiple groups of comparison results without judging pros and cons.
2. For multiple outputs of each sample, use the judge model for comparison and scoring: Based on execution results of multiple versions, use the judge model for comparison and scoring to judge pros and cons. It is necessary to select a judge model and prompt content for comparison and scoring, similar to the judge model and prompt content.
Evaluation Task Management
The evaluation task management page displays a list of all evaluation tasks and provides information such as task status, progress, and consumption, as well as operations like start, suspend, annotate, download, view report, copy, and delete.
Note:
Start: Start up a saved but not running evaluation task.
Suspend: Suspend tasks that are running or in queue, and resume after pausing.
Resume evaluation: Restore a paused task.
Annotation: Annotation can be performed after batch processing or evaluation is completed, and supports multiple annotations.
Download: Download the evaluation result file after batch processing or evaluation is completed.
View report: View the report after batch processing or evaluation is completed. The report provides scoring results, performance data, etc.
Copy: Create a new evaluation task that retains the benchmark evaluation and comparison evaluation configuration content of the current task.
Delete: Delete the evaluation task. It cannot be recovered once deleted.
Copy Evaluation Task
Create a new evaluation task that retains the benchmark evaluation and comparison evaluation configuration content of the current task, and supports users to change the configuration content.
View Evaluation Report
View the report after batch processing or evaluation is completed. The report shows evaluation configuration message, benchmark evaluation result, comparison scoring result, benchmark performance report, etc.
Download Evaluation Result
To see the complete evaluation result and records, click the download button on the right of the evaluation task to trigger the download task.
After download completion, click the "bell" in the upper right corner to download a file in the notification center. The file shows original data in detail.
Evaluation Result Annotation
After the evaluation task is completed, the user can perform manual annotation on the execution result to enhance the accuracy and reference value of the evaluation. The annotated result will be reflected in the evaluation report and the downloaded result file.
Note:
It is advisable not to change the knowledge base content during the evaluation process, including addition, import, deletion, and modification of knowledge settings, to avoid affecting the evaluation result.