The application evaluation feature allows users to batch test and assess the performance and effectiveness of Agent applications. Two types of evaluation are currently supported:
Benchmark evaluation: Runs application outputs in batches and evaluates them. Supports reference answers or custom content (passed through custom columns). Multiple scoring methods are supported, including judge models, rules, or code.
Comparison evaluation: Configures multiple model groups or multiple prompts within the same application for comparison, enabling quick evaluation of performance differences under the same tasks.
With application evaluation, users can continuously optimize the performance and user experience of Agent applications.
After entering the Knowledge Base Q&A application details page, click Application Evaluation Application Evaluation to open the evaluation module. This module includes two parts: Evaluation Set and Evaluation Task which are described below:
Evaluation Set
The evaluation set is a test dataset for batch testing application effects. It supports unified management and can be used repeatedly in follow-up tests.
Uploading Evaluation Set
1. Click Evaluation Set to enter the evaluation set management page.
Note:
Note: Evaluation set file rule
Evaluation set files currently support .xlsx format.
Each evaluation set file size cannot exceed 20 MB.
Up to 10 custom columns can be added. Custom columns store additional information needed during evaluation and can be referenced as variables in evaluation rules. Example: a custom column "reference_output" can be used to store reference answers for scoring.
3. Click to upload the evaluation set file.
Manage Evaluation Sets
On the evaluation set management page, you can manage uploaded sets, including downloading and deleting them.
Evaluation Task
An evaluation task is the core step in executing application evaluation. It allows users to choose evaluation methods, start tasks, and manage the process.
Creating an Evaluation Task
Creating an evaluation task involves two steps: Benchmark Evaluation Configuration and Comparison Evaluation Configuration, which will be explained in detail in the following text.
Step One: Benchmark Evaluation Configuration
Benchmark evaluation configuration allows you to select an evaluation set, run it in batches according to the application configuration to obtain results, and set the scoring method. Multiple scoring methods are supported, including judge model, rule-based, or code-based scoring. Regardless of the method selected, the results can be downloaded in bulk and manually labeled through the interface.
Note:
Name: Task name, up to 20 characters, must be unique.
Evaluation Set: Select uploaded evaluation sets (up to 5). Column names must match; otherwise the system will report an error.
Scoring method: Choose how to score the outputs, including: judge model scoring, rule-based scoring, code-based scoring, or output only (no scoring).
1. Judge Model Scoring: Evaluates batch execution results of the evaluation set via a model, consisting of two parts: the referee model and the prompt content.
Judge model: The model acting as judge. Supports parameter configuration.
Referee prompt content: The prompt for scoring. Supports referencing queries, preset columns, custom columns, application outputs, and system variables. Ensure the prompt aligns with the dataset content (e.g., when scoring against reference answers, upload reference answers as reference_output). Templates are also available. Click Template to view more suitable prompt content in the template library.
Note:
Judge model scoring consumes tokens. Verify prompt accuracy before running tasks to avoid unnecessary consumption.
2. Rule scoring: Scores outputs according to predefined rules. Supports single or multiple rules.
3. Code scoring: Uses custom code (Python supported) to score outputs. Can directly reference queries, preset columns, and custom columns.
4. Only output results without scoring: Runs the evaluation set in batch without scoring. Equivalent to batch processing.
Step Two: Comparative Evaluation Settings
Comparison evaluation allows you to compare multiple models or prompts within the same application to quickly assess performance differences. Supported options include:
No comparitive evaluation
Multi-model comparitive evaluation
Multi-prompt content comparitive evaluation
This feature is supported only in Standard Mode applications. When selecting to perform a comparitive evaluation, the following processing methods can be applied to the comparison result.
Only output multiple results: generate multiple results without comparison scoring.
Comparison scoring: use a judge model to score each sample's multiple results.
1. No comparitive evaluation (default): Runs with the current application’s test configuration only.
2. Multi-model comparitive evaluation: Add up to 2 comparison models to compare with the configured model. If the same model is selected, the system reports an error.
3. Multi-prompt content comparitive evaluation: Add up to 2 comparison prompts to compare with the configured prompt. If the same prompt is selected, the system reports an error.
After selecting the comparitive method, you can choose whether to perform comparison scoring
1. Only output multiple results without comparitive scoring (default): Only output multiple comparison results without pros and cons judgment.
2. Use the judge model to conduct comparative scoring for the multiple output results of each sample: Based on multiple versions of execution results, use a judge model for comparison scoring to judge pros and cons. Need to select a comparison scoring judge model and comparison scoring judge prompt content, similar to the judge model and judge prompt content.
Evaluation Task Management
The task management page displays all evaluation tasks with status, progress, and resource consumption, and supports operations such as start, pause, resume, label, download, report view, copy, and delete.
Note:
Start: Launch a saved but not yet run task.
Pause: Pause a running or queued task (can be resumed).
Resume evaluation: Restore the paused task.
Label: After completion, manually label outputs. Multiple rounds of labeling are supported.
Download: Download evaluation result files after completion.
View Report: Open a detailed report including scoring results and performance data.
Copy: Create a new task with the same benchmark and comparison configuration.
Warning: Delete the evaluation task. It cannot be restored once deleted.
Copying an Evaluation Task
Creates a new evaluation task with the same benchmark and comparison configurations as the current one. Users can modify the configuration if needed.
Viewing the Evaluation Report
After batch runs or task completion, you can view the report. Reports display configuration details, benchmark results, comparison results, and benchmark performance reports.
Downloading Evaluation Result
After batch runs or task completion, you can download the evaluation results file, which contains detailed raw data.
Labeling Evaluation Result
After an evaluation task completes, users can manually label results to improve accuracy and reliability. Labeled results will appear in both the evaluation report and downloaded files.
Note:
During application evaluation, it is recommended not to modify the Knowledge Base (adding, deleting, or editing knowledge) to avoid affecting evaluation results.