tencent cloud

Guide on Using Distributed Training

Download
Focus Mode
Font Size
Last updated: 2025-05-09 16:13:12
This guide focuses on two points:
1. Usage methods of different distributed training modes in TI-ONE's task-based modeling.
2. How to initiate RDMA multi-machine and multi-card network acceleration training in TI-ONE task-based modeling?


I. How to Use TI-ONE Distributed Training Mode

TI-ONE's task-based modeling supports various distributed training modes, including DDP, MPI, Horovod, and PS-Worker. The following document describes the corresponding usage methods of different training modes on TI-ONE:

Instructions for DDP Distributed Training

The PyTorch Distributed Data Parallel (DDP) training mode supports data parallel training in PyTorch. The data parallel mode can process multiple data batches simultaneously across multiple processes. The input data batches of each process do not overlap. Each process calculates gradients and uses the ring-all-reduce algorithm to synchronize with other processes.

Usage Method

1. When creating a training task on the "Task-Based Modeling" page, select the training mode as DDP and configure the node resource and the number of nodes. The start command will be executed on each node.
2. The DDP training mode includes two types of roles, Master and Worker. The role numbered 0 is Master (corresponding to RANK=0 in the environment variable), which is used to save the model.
3. TI-ONE creates a corresponding instance according to the task configuration and injects relevant environment variables, including the instance group information contained in the task and the role of the current instance. The Worker will wait for the Master to start properly and the network to be connected. The following shows the list of environment variables that are by default injected when task-based modeling starts:
Built-In Environment Variables
Variable Name
Variable Description
Example
NODE_LIST
Common environment variables for training tasks: list of task nodes and info of the number of GPU cards of nodes
NODE_LIST=timaker-xxxyyy-launcher.training-job.svc.cluster.local:1,timaker-xxxyyy-worker-0.training-job.svc.cluster.local:1
INDEX
Common environment variables for training tasks: the index of the current node information in NODE_LIST, starting from 0
INDEX=1
MASTER_ADDR
IP address of the Master node of the DDP training task
MASTER_ADDR=10.35.110.11
MASTER_PORT
Port of the Master node of the DDP training task
MASTER_PORT=23456
WORLD_SIZE
Number of nodes of the DDP training task
WORLD_SIZE=2
RANK
Current node of the DDP training task
RANK=1
GPU_NUM
Total number of GPU cards included in the task
GPU_NUM=2
GPU_NUM_PER_NODE
Number of GPU cards of a single node
GPU_NUM_PER_NODE=1
4. If the exit code of any instance is non-zero during the training process, the training task will fail. If all instances succeed, the training task succeeds.


Startup Method Example

The example of the command to start up DDP distributed training is shown as follows:
python -m torch.distributed.launch --nproc_per_node $GPU_NUM_PER_NODE --nnodes $WORLD_SIZE --node_rank $RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT
The correspondence between DDP distributed training parameters and platform environment variables is shown in the following table:
Variable Name
Variable Description
nproc_per_node
The number of processes running on a single instance (machine) is typically the number of GPU cards on each machine when GPU is used. Use the value of the corresponding environment variable GPU_NUM_PER_NODE for GPU training.
nnodes
The value of the corresponding environment variable WORLD_SIZE.
node_rank
The value of the corresponding environment variable RANK.
master_addr
The value of the corresponding environment variable MASTER_ADDR.
master_port
The value of the corresponding environment variable MASTER_PORT.
For more DDP training parameters, refer to the official website document torchrun (Elastic Launch).

For the built-in environment variables involved in the startup command, the platform will inject them when starting a task-based modeling task. When debugging code in Notebook or locally, developers need to assign values to the corresponding environment variables first. For convenience in debugging, you can set default values for the corresponding environment variables. Example:
MASTER_ADDR=${MASTER_ADDR:-localhost} MASTER_PORT=${MASTER_PORT:-23456} NNODES=${WORLD_SIZE:-1} NODE_RANK=${RANK:-0} GPU_PER_NODE=${GPU_NUM_PER_NODE:-$(nvidia-smi -L | wc -l)} python -m torch.distributed.launch --nproc_per_node $GPU_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT


How to Use MPI and Horovod Distributed Training

Message Passing Interface (MPI) is a messaging standard used for distributed parallel training. The platform allows users to initiate distributed training tasks in MPI mode, and supports common training frameworks based on MPI communication, including Horovod. The Horovod training mode is more natively adapted to tasks trained with the Horovod framework. This document takes the above two training modes as examples to introduce how to initiate distributed training tasks on a machine learning platform.

Usage Method

1. When creating a training task on the "Task-Based Modeling" page, select the training mode as MPI or Horovod and configure the node resource and the number of nodes.
2. Both MPI and Horovod modes contain two types of roles: Launcher and Worker. Both types of roles can execute training tasks. When a task is configured with only one instance, a Launcher instance will be created by default.
3. The MPI mode startup command will be executed on each instance. The Horovod mode startup command will only be executed on the Launcher instance. The Worker instance command is configured as sleep infinity to wait for the Launcher's command. The following shows the list of environment variables injected by default when task-based modeling starts:
Built-In Environment Variables
Variable Name
Variable Description
Example
OMPI_MCA_orte_default_hostfile
Node information file for MPI or Horovod training tasks
OMPI_MCA_orte_default_hostfile=/etc/mpi/hostfile
GPU_NUM
Total number of GPU cards included in the task
GPU_NUM=2
GPU_NUM_PER_NODE
Number of GPU cards of a single node
GPU_NUM_PER_NODE=1
NODE_IP_SLOT_LIST
Node IP addresses and corresponding card number contained in a task (only supported for configuring startup command)
NODE_IP_SLOT_LIST=9.0.255.56:1,9.0.255.118:1
4. TI-ONE creates a corresponding instance group based on task configurations, injects relevant environment variables, and provides the instance group information contained in the task, as well as the role of the current instance.
5. If the exit code of any instance is non-zero during the training process, the training task will fail. If all instances succeed, the training task succeeds.


Startup Method Example

The content of /etc/mpi/hostfile is shown as follows:
train-960258573108964736-7an39bddmfpc-launcher slots=1 train-960258573108964736-7an39bddmfpc-worker-0 slots=1
The content is divided into two columns. The first column is the domain name of the instance, and the second column is the count of processes on the instance.

The example of the command to start up MPI or Horovod distributed training is shown as follows:
# Startup in MPI mode
mpirun --allow-run-as-root -np $GPU_NUM -H $NODE_IP_SLOT_LIST python3 train.py --data-dir /opt/ml/input/data

# Startup in Horovod mode
horovodrun -np $GPU_NUM -H $NODE_IP_SLOT_LIST --network-interface eth0 python3 train.py --data-dir /opt/ml/input/data


How to Use PS-worker Distributed Training?

Parameter Server (PS) training is a common data parallelism method used to scale model training across multiple machines. The training cluster consists of workers and Parameter Server (PS). The parameters are saved on the PS. In each training round, the PS distributes parameters to workers. After calculation, workers transmit gradients back to the PS for an update.

Usage Method

1. When creating a training task on the "Task-Based Modeling" page, select the training mode as PS-Worker and configure the node resources and the number of nodes.
2. The PS-Worker training mode provided by the platform contains two types of roles: PS and worker. PS saves and updates parameters, and the number of instances should be greater than or equal to 1. The workers execute the training, and the number of instances must be greater than or equal to 1.
3. TI-ONE creates a corresponding instance based on task configurations, injects the corresponding environment variable TF_CONFIG, and provides the instance group information included in the task, as well as the role of the current instance. Instances obtain the quantity and address of ps/worker in the task by reading TF_CONFIG, and learn about the role and number of the current instance from type in the task.
Environment Variables
TF_CONFIG
{ "cluster": { "ps": [ "train-960252492096760832-7an13ppfli80-ps-0.train-100031385875.svc:2222", "train-960252492096760832-7an13ppfli80-ps-1.train-100031385875.svc:2222" ], "worker": [ "train-960252492096760832-7an13ppfli80-worker-0.train-100031385875.svc:2222", "train-960252492096760832-7an13ppfli80-worker-1.train-100031385875.svc:2222" ] }, "task": { "type": "ps", "index": 0 }, "environment": "cloud" }
4. If the exit code of any instance is non-zero during the training process, the training task will fail. If all instances succeed, the training task succeeds.


How to Use Ray Distributed Training?

Ray is an open-source distributed computing framework that simplifies the development and deployment of distributed machine learning. Ray provides a set of APIs and infrastructure, enabling developers to easily scale standalone training code to a distributed environment, and supporting multiple parallel policies, including data parallelism and model parallelism.


Usage Method

1. When creating a training task on the "Task-Based Modeling" page, select the training mode as Ray and configure the Head node resources, the Worker node resources, and the count of each group of Worker nodes. A maximum of 5 groups of workers can be configured.
2. The Ray training mode contains two types of roles: Head and Worker. The node with ID 0 is the Head node (corresponding to RANK=0 in the environment variable), and coordinates the computing resources and schedules tasks of the entire cluster. Worker nodes execute specific training tasks.
3. TI-ONE creates a corresponding instance based on task configurations and injects relevant environment variables. The following shows the list of environment variables injected by default when task-based modeling starts:
Variable Name
Variable Description
Example
HEAD_ADDR
Address of the Head node of the Ray cluster
HEAD_ADDR=train-1282671078021627392-9qu5j2n90b9c-head-0
HEAD_PORT
Port of the Head node of the Ray cluster
HEAD_PORT=6379
RANK
The serial number of the current node in the cluster. RANK0 refers to a Head node.
RANK=0


Startup Method Example

You only need to use ray.init() for default initialization in your code, and configure the execution command of your script as the startup command. Your task will be submitted to the cluster on the Head node by default. You are not required to specify the address of the Head node.
Take a simple counting task as an example. Save the following code as job.py.
import ray

ray.init()

# Define the Actor class.
@ray.remote
class Counter:
def __init__(self):
self.value = 0
def increment(self):
self.value += 1
return self.value

# Create an Actor instance.
counter = Counter.remote()

# Concurrently call the Actor method.
futures = [counter.increment.remote() for _ in range(10)]
results = ray.get(futures) # [1, 2, 3, ..., 10]

print("Counter result:", results)

Use job.py as a training task, place it in the specified directory of your CFS, and mount it to the /opt/ml/code directory in task-based modeling. Then, specify the startup command as:
cd /opt/ml/code; python job.py

In a training task, you can specify a node using INDEX. For example, execute the following function on the node where INDEX = 1:
@ray.remote(resources={"Rank:1": 0.001})
def f(a, b, c):
return a + b + c


Note:
1) You cannot view Ray TensorBoard temporarily.
2) Ray is designed based on integer resource scheduling. For example, if a task claims to need 1 CPU, Ray ensures it occupies a complete physical core to avoid resource competition. Therefore, it is recommended to avoid using decimal core numbers when configuring CPU resources of Ray resource groups. For instance, 0.7 cores will be rounded down by Ray to 0, which may cause unexpected program behaviors (for example, the task fails to run).
3) Since TI-ONE supports GPU fragment scheduling, such as 0.2 GPU resources. In a Ray cluster, the Worker will treat 0.2 GPU as a complete GPU to use, which can be set by nums_gpu=1.



II. Launching RDMA Network Acceleration Training

RDMA is a communication technology of kernel bypass, which significantly improves the communication bandwidth in scenarios of multi-machine communication. This document introduces how to use the RDMA network in the task-based modeling of TI-ONE.

Prerequisites

1. The resource group contains at least 2 high-performance GPU nodes that support RDMA.
2. If a submitted distributed task is configured with 2 or more nodes, and each node is configured with 8 fully-idle GPU cards, the platform will by default configure RDMA resources for this task.
3. The built-in LLM images on the platform support common HCC high-performance GPU models by default. For custom images, you need to install a user-mode RDMA driver.

How to Confirm Whether RDMA Takes Effect?

For multi-machine tasks running on the platform, the following logs will be generated if RDMA is enabled:
[0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA

Platform Built-In Environment Variables

Remote Direct Memory Access (RDMA) will be enabled for multi-machine training scenarios of Hyper Computing Cluster (HCC) models, and the following NVIDIA Collective Communications Library (NCCL) environment variables will be built-in. When using TI-ONE, you are not required to set them explicitly.
NCCL_IB_GID_INDEX=3
NCCL_IB_SL=3
NCCL_CHECK_DISABLE=1
NCCL_P2P_DISABLE=0
NCCL_IB_DISABLE=0
NCCL_LL_THRESHOLD=16384
NCCL_IB_CUDA_SUPPORT=1
NCCL_IB_HCA=mlx5_bond
NCCL_NET_GDR_LEVEL=2
NCCL_IB_QPS_PER_CONNECTION=4
NCCL_IB_TC=160
NCCL_PXN_DISABLE=1
NCCL_IB_TIMEOUT=24
NCCL_DEBUG=INFO
NCCL_SOCKET_IFNAME=eth0
GLOO_SOCKET_IFNAME=eth0
TCCL_TOPO_AFFINITY=4
For supported NCCL environment variables, see environment variables.


Help and Support

Was this page helpful?

Help us improve! Rate your documentation experience in 5 mins.

Feedback