tencent cloud

Creating a Task
Last updated:2026-03-12 20:38:43
Creating a Task
Last updated: 2026-03-12 20:38:43

Steps

Filling in Basic Information

1. Log in to the TI-ONE Console. Choose Training Workshop > Task-based Modeling, and click Create to create a training task.
2. On the basic information configuration page, fill in the following information:
Task Name: Only Chinese and English characters, digits, underscores "_", and hyphens "-" are supported. It should start with a Chinese, an English character, or a digit.
Region: The region where the training task is located. It defaults to the region of the current list page.
Training Image: You can select a platform built-in training image, a custom image, or a built-in large model. For the information about the built-in training image, view the Built-in Training Image List. For the custom image, you can select an image from the Tencent Container Registry (TCR) or fill in an external image address (enter a username and password for a private image). For custom image specifications, view Custom Training Image Specifications.
Training Mode: For the training modes supported by different training frameworks, view the Built-in Training Image List.
Billing Mode: You can select pay-as-you-go mode or yearly/monthly subscription. If you select the pay-as-you-go mode, you need to select the computing power specifications and the quantity of nodes. If you select yearly/monthly subscription, you need to create a resource group and purchase nodes first. For related operations, see Resource Group Management. After selecting the resource group, you need to select the corresponding computing resources. For the billing specifications supported by the platform, see Billing Overview.
Note:
1. Once a resource group is selected, an overview of the remaining GPUs in the resource group will be displayed. The displayed information includes the total number of GPUs of each card model, the number of full GPUs, and the number of non-full GPUs (fragmented GPUs). This helps users quickly understand the GPU distribution in the selected resource group. Based on the current task scenario, you can use full resources or non-full (fragmented) resources, which can effectively reduce overall resource fragmentation and improve the overall utilization rate of GPUs.
2. Click View Details to show the details of the resource dashboard on the right of the current page. The dashboard shows the remaining available resources and total resources of nodes for each card type. Click the drop-down list to show all running tasks or services of the current node, which can help you quickly understand the usage of node resources to negotiate resource usage.



Tag: You can create multiple tags for a task.
Description: You can add a remark description of up to 500 words.
CLS Log Shipping: Disabled by default. The TI console will display logs for the past 15 days by default. To store logs persistently and access services, including log retrieval, you can enable CLS log shipping. Logs can be shipped to the CLS (make sure the CLS has been enabled). For the product introduction and pricing guide of CLS, please refer to Billing Overview.
Auto Restart: You can configure an automatic restart policy for the task. You need to configure the maximum number of restarts, which can be up to 10 times. Once the maximum number of restarts is exceeded, the task will be directly marked as abnormal. The automatic restart of the current task will be triggered if an abnormal exit occurs during the task operation. Currently, this function only supports training tasks with the billing mode of yearly/monthly subscription and the training modes of MPI, DDP, and Horovod. Choose Task Detail > Event to view the event information of the task's automatic restart.

Filling in Task Configuration Information

On the task configuration page, you need to configure the algorithm, data, and input and output information about this training task. Configuration items are described as follows:
1. Storage path settings: Supported storage types are COS, CFS (including CFS Turbo), and EMR (HDFS).
If you select COS, you need to select the COS path where the dataset resides;
If CFS or EMR (HDFS) is selected, you need to drop down to select the CFS file system and EMR cluster, and simultaneously fill in the source directory that needs to be mounted by the platform.
For the above data sources, you can define the mount path of the data within the training container during configuration. You need to fill in this path in your code to obtain the data. When creating a task, you can select multiple data paths, set different container mount paths respectively, and mount them into the container for the training algorithm to read.
Precautions for using EMR (HDFS): The platform will access and mount HDFS using the Hadoop identity by default. To use another identity, please upload the relevant configuration files in accordance with the following code package specifications.
The username and keytab files are provided by the user and placed in the code package.
Code package specifications: //<emr_id>/username.txt. (The content is the username, such as "hadoop/172.0.1.5". When Kerberos authentication is not enabled, the default username "hadoop" can be used if the file does not exist or is invalid. After Kerberos authentication is enabled, the default username is unavailable.) //<emr_id>/emr.keytab. The content is the keytab authentication file. (The platform supports authentication for multiple EMRs at the same time. Therefore, it is necessary to add <emr_id> to the directory, with values such as: emr-1rnhggsh.)
2. Code package: You can choose a file path in COS as the code package directory, or leave this option blank. Read the mounted path set in the storage path as the code package path. The code package can be uploaded to a COS bucket first, or you can directly click Upload in the COS file dialog box of TI-ONE.
3. Startup command: You need to fill in the program entry command. Multiple lines are supported. The default working directory is /opt/ml/code.
4. Tuning parameters: The filled-in hyperparameter JSON will be saved as the /opt/ml/input/config/hyperparameters.json file. You need to parse your code.
5. Training output: Select the COS path where you need to save the training output. The platform will upload the data in the /opt/ml/output path to the output COS path regularly by default. To release the trained model to the model repository with one click, you need to save the model output to the /opt/ml/model path. The platform will upload the data under this path to the COS path after the training is completed. If you choose a file system such as CFS as the training storage, you can also choose not to configure the training output and directly write the training output to the mounted CFS file path.

Additionally, during the process of task configuration, please note that the price of your current task configurations will be displayed in real time at the bottom. Once all information is configured, the task is created.

Preset Process Description for Built-In Large Models

Task-based modeling includes multiple large model fine-tuning templates. You can directly start a built-in large model fine-tuning task with one click. The following are the built-in field descriptions:

Preset Storage Path Settings

For the first line "Platform CFS": The system has configured the supporting training code for fine-tuning the large model by default.
For the second line "Platform CFS": The system has configured a set of sample data for fine-tuning the large model for you. To use your custom business data to fine-tune this large model, you can delete this line and add other storage sources at the bottom.
For the third line "Platform CFS": The system has configured a built-in model by default.
For the fourth line "User CFS": You need to select your CFS file system and source path here. The "Container Mounting Path" is filled in by the system by default and you do not need to modify it. To use another CFS file system as the training output, you can delete this line and then add a new one.

Note: To use your own business data for fine-tuning, you need to use the format agreed by the platform as follows or observe the requirements of the dataset_info.json data configuration file of LlamaFactory. For details, click to navigate to the relevant page.




Preset Startup Command

The platform populates the startup command by default. Generally, you do not need to modify the startup command.

Predefined Tuning Parameters

The platform provides multiple predefined parameters. You can directly modify the hyperparameter JSON to iterate the model. The following shows the definition of hyperparameters:
Hyperparameter
Default Value
Explanation
Epoch
2
Iteration rounds in the training process
BatchSize
1
Number of samples in each training iteration. A larger BatchSize indicates a faster training speed and a higher memory usage.
LearningRate
1e-5
The hyperparameter for updating weights during the gradient descent process. If the value is too large, it is difficult for the model to converge. If the value is too small, it will cause the model to converge too slowly.
Step
750
It refers to how many steps need to be run each time a model checkpoint is saved. Saving more checkpoints requires larger storage space.
UseTilearn
true
Whether to enable Tencent's self-developed acceleration, with options of "true/false". When it is set to "true", Tencent's self-developed acceleration framework for training will be enabled by default. The 3D parallel acceleration requires more than 8 GPUs and settings of PP and TP parameters. For details, you can refer to the angel-tilearn document. When it is set to "false", the open-source acceleration framework will be used for training by default. This feature is only available for some models.
FinetuningType
Lora
You can customize and select the fine-tuning training mode "Lora/Full". In the LoRA mode, on the basis of fixing the parameters of the pre-trained large model, low-rank decomposition is performed on the weight matrix, and only the low-rank parameters are updated during the training process. In the FULL mode, all model parameters will be updated during fine-tuning, which requires more training resources.
MaxSequenceLength
2048
The maximum text sequence length can be reasonably set according to the length of your business data. For example, if the length of most of your business data is below 2048, you can set MaxSequenceLength to 2048. Data larger than 2048 will be truncated to 2048, which can reduce the pressure on the GPU's video memory.
GradienAccumulationSteps
1
A huggingface trainer parameter. The default value is 1. You can increase the batch size.
GradientCheckPointing
True
A huggingface trainer parameter. By default, it is set to True. It is a policy of trading time for memory. When enabled, it optimizes the memory usage, but the training speed will be slowed down.
DeepspeedZeroStage
z3
DeepSpeed ZeRO stage configuration, available values ["z0", "z2", "z2_offload", "z3", "z3_offload"], defaulting to z3. It is available for some models only.
ResumeFromCheckpoint
Ture
Whether to automatically resume training from existing checkpoint files. The default value is True, which means if there are checkpoint files in the output directory, the training will resume from the latest checkpoint. False means to retrain. If it is set to False and the output directory is not null, an error will be reported. It is recommended to use a null directory for the training output path. To enable forced overwrite, you need to manually add the "overwrite_output_dir": true parameter.
TilearnHybridTPSize
1
Tilearn 3D parallel parameter, the dimension of TP parallelism, defaulting to 1. It is partially opened for some models.
TilearnHybridPPSize
1
Tilearn 3D parallel parameter, the dimension of PP parallelism, defaulting to 1. It is partially opened for some models.
Was this page helpful?
You can also Contact Sales or Submit a Ticket for help.
Yes
No

Feedback