# tilearn-llm>=0.9.3# tilearn.ops>=0.2.1.172pip3 uninstall -y tilearn.llm tilearn.opspip3 install tilearn-llm==0.9.3 -i https://pypi.tuna.tsinghua.edu.cn/simplepip3 install tilearn.ops==0.2.1.172 -i https://g-bnvx3728-pypi.pkg.coding.net/tione/tilearn/simplewget https://tione-public-cos-1308945662.cos.ap-shanghai.myqcloud.com/tilearn/hybrid_parallel/colossalai-0.3.4.1-cp310-cp310-linux_x86_64.whlpip3 install colossalai-0.3.4.1-cp310-cp310-linux_x86_64.whl
### TILEARN.LLMfrom tilearn.llm.transformers import LlamaForCausalLM### The model API is consistent with the standard huggingfacemodel = LlamaForCausalLM.from_pretrained(...)
### TILEARN.LLMfrom tilearn.llm.transformers import AutoModelForCausalLM### The model API is consistent with the standard huggingfacemodel = AutoModelForCausalLM.from_pretrained(...)
export TILEARN_LLM_BAICHUAN_13B=2.export TILEARN_HYBRID_TP_SIZE=1export TILEARN_HYBRID_PP_SIZE=2
### Computational optimizationfrom tilearn.llm.transformers import LlamaForCausalLMfrom tilearn.llm.transformers import AutoModelForCausalLM### 3D parallelimport tilearn.llm.hybrid_paralleldef main():### The model API is consistent with the standard huggingfacemodel = AutoModelForCausalLM.from_pretrained(...)run_exp()
# 8 x A100 40G default parameterGradienAccumulationSteps=64BatchSize=1GradientCheckPointing=FalseTilearnHybridTPSize=2TilearnHybridPPSize=2# 8 x A800 80G Default ParametersGradienAccumulationSteps=32BatchSize=1GradientCheckPointing=FalseTilearnHybridTPSize=1TilearnHybridPPSize=2TilearnHybridZeroStage=1 3. Tilearn-Angel training acceleration effect

ti-acc2.0-torch1.9-py3.8-cuda11.1-gputi-acc1.0-tf1.15-py3.6-cuda10.0-gpu mirrorpython3 -u -m tiacc_training.distributed.launch --nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT main.py
Hardware Environment | Model | GPU Quantity | Native DDP (examples/sec per V100) | TI-ACC Communication Optimization (examples/sec per V100) |
Tencent Cloud GN10Xp.20XLARGE320 | resnext50_32x4d | 1. Standalone | 227 | 227 |
| | 8 (Standalone) | 215 | 215 |
| | 16 (Two-node) | 116 | 158.6 |
import torch.cuda.amp as ampimport tiacc_training.torchscaler = amp.GradScaler()#Instantiate an object of the adaptive mixed precision policy classpolicy = tiacc_training.torch.tiacc_torch_warp.MixedPrecision_TrainingPolicy(policy,start_step,hold_step,end_step,interval_time,interval_hold_time)#Determine whether mixed precision needs to be enabled for the current epoch based on the input parametersmixed_precision = policy.enable_mixed_precision(epoch,lr=lr,loss=loss,scaler=scaler)with amp.autocast(enabled=mixed_precision):outputs = model(inputs)loss = criterion(outputs, targets)scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()
Hardware Environment | Model | GPU Quantity | Native PyTorch (examples/sec per V100) | TI-ACC Data IO Optimization (examples/sec per V100) | TI-ACC Data IO + Adaptive Hybrid Precision Optimization (examples/sec per V100) |
Tencent Cloud GN10Xp.20XLARGE320 | resnet50 mmcls | 8 (Standalone) | 70.8 | 350.5 | 379.2 |
| centernet mmdet | 8 (Standalone) | 26.4 | 28.6 | 30.6 |
# Start containerdocker run -itd --name tiacc-rec-fm --network=host --ipc=host ccr.ccs.tencentyun.com/ti-platform/tensorflow:1.15.5-py3-rec-0121# Enter containerdocker exec -it tiacc-rec-fm bash# Native tensorflow embedding usagecd wideanddeep && bash start_all.sh --fm# Optimized method of use for tiacc lookupcd wideanddeep && bash start_all.sh --tiacc --fm
Hardware Environment | Model | GPU Quantity | Native TensorFlow (global_steps/sec per V100) | TI-ACC Optimization after (global_steps/sec per V100) |
Tencent Cloud GN10Xp.20XLARGE320 | DeepFM | 16 (Two-node) | 41.9 - 56 | 96.1 - 103.3 |
| Wide & Deep | 16 (Two-node) | 49.9 - 69 | 120 - 128 |
### TILEARN.LLMfrom tilearn.llm.transformers import LlamaForCausalLM### The model API is consistent with the standard huggingfacemodel = LlamaForCausalLM.from_pretrained(...)
export TILEARN_HYBRID_TP_SIZE=1export TILEARN_HYBRID_PP_SIZE=2
### Computational optimizationfrom tilearn.llm.transformers import LlamaForCausalLMfrom tilearn.llm.transformers import AutoModelForCausalLM### 3D parallelimport tilearn.llm.hybrid_paralleldef main():### The model API is consistent with the standard huggingfacemodel = AutoModelForCausalLM.from_pretrained(...)run_exp()
# 8 x A100 40G default parameterGradienAccumulationSteps=64BatchSize=1GradientCheckPointing=FalseTilearnHybridTPSize=2TilearnHybridPPSize=2TilearnHybridZeroStage=1# 8 x A800 80G Default ParametersGradienAccumulationSteps=32BatchSize=1GradientCheckPointing=FalseTilearnHybridTPSize=1TilearnHybridPPSize=2TilearnHybridZeroStage=1
Parameter | Type | Required or Not | Description | Example | Default Value |
policy | INT | Yes | Self-adaptive mixed precision strategy 0: Time mixed precision, suitable for common adaptive situations; 1: Time learning rate mixed precision strategy, suitable for the situation where loss fluctuation shows exception in a certain stage of the training process; 2: Loss function mixed precision strategy, suitable for the situation where loss decreases too fast or too slow during the training process. | 0 | None |
start_time | INT | No | Set the start time of enabling adaptive mixed precision. It is generally recommended to set it as 10. It is required when the policy is 0 or 1, and optional when the policy is 2. | 10 | 10 |
end_time | INT | No | Set the end time of enabling adaptive mixed precision. It is generally recommended to set it as the time of the last epoch. It is required when the policy is 0 or 1, and optional when the policy is 2. | 1000 | None |
hold_time | INT | No | The hold time when enabling policy 1. Use a unified policy during the hold time: enable or not enable. It is generally recommended to set it as the duration of exceptional fluctuation of loss during the training process. It is required when the policy is 1, and optional when the policy is 0 or 2. | 20 | None |
interval_time | INT | No | The interval time for enabling policy 2. The default value is 1000, that is, enable policy 2 every 1000 epochs. It is required when the policy is 2, and no need to fill in when the policy is 0 or 1. | 1000 | 1000 |
interval_hold_time | INT | No | The hold time after enabling policy 2 at the interval_time. The default value is 100. For example, if the interval_time is 1000, enable policy 2 from 1000 to 1100, 2000 to 2100... It is required when the policy is 2, and no need to fill in when the policy is 0 or 1. | 100 | 100 |
Objects | Type | Object Description |
policy | MixedPrecision_TrainingPolicy class | The instantiation object of the adaptive policy for automatic mixed precision during the training process |
Parameter | Type | Required or Not | Description | Example | Default Value |
epoch | INT | Yes | Current epoch | 20 | None |
scaler | torch.cuda.amp.GradScaler | Yes | Instantiate objects with gradient scaling | scaler | None |
lr | float | No | lr is the learning rate of the current epoch. | 0.01 | None |
loss | float | No | loss is the loss of the previous epoch. | 0.1 | None |
Output Parameter | Type | Description |
mixed_precision | BOOL | The input parameters obtain whether automatic mixed precision is required to be enabled for the current epoch. Return TRUE if yes, otherwise return FALSE. |
Feedback