tencent cloud

Feedback

Using GPU Instance to Train ViT Model

Last updated: 2024-01-11 17:11:13
    Note:
    This document is written by a Cloud GPU Service user and is for study and reference only.

    Overview

    This document describes how to use a GPU instance to train a ViT model offline to complete a simple image classification task.

    ViT Model Overview

    The Vision Transformer (ViT) model is proposed by Alexey Dosovitskiy to get the state-of-the-art (SOTA) result from multiple tasks.
    
    
    
    For an input image, ViT splits it into multiple subimage patches. Each patch is spliced with position embedding and combined with class labels to be input to transformer encoder together. After the corresponding output layer results of the class label positions pass through a network, the ViT result will be output. In the pretraining status, the ground truth of the result can replaced by a patch of the mask.

    Instance Environment

    Instance type: In this document, you can select a GN7 or GN8 model. Based on the GPU performance comparison provided in Tesla P40 vs Tesla T4, the performance of T4 in Turing architecture is higher than that of P40 in Pascal architecture. Therefore, GN7.5XLARGE80 is selected in this document.
    Region: As large datasets may need to be uploaded, we recommend you select the region with the lowest latency. This document uses the online ping tool for testing. As the latency between the test region and Chongqing region where GN7 resides is the lowest, Chongqing region is selected in this example.
    System disk: 100 GB Premium Cloud Storage disk.
    Operating system: Ubuntu 18.04.
    Bandwidth: 5 Mbps.
    Local operating system: macOS

    Directions

    Setting passwordless login for your instance (optional)

    1. (Optional) You can configure the server alias in ~/.ssh/config on your local server. In this document, the alias tcg is used.
    2. Run the ssh-copy-id command to copy the SSH public key of the local server to the GPU instance.
    3. Run the following command in the GPU instance to disable password login to enhance security:
    echo 'PasswordAuthentication no' | sudo tee -a /etc/ssh/ssh\\_config
    4. Run the following command to restart the SSH service.
    sudo systemctl restart sshd

    Configuring the PyTorch-GPU development environment

    To use pytorch-gpu for development, you need to further configure the environment as follows:
    1. Install the NVIDIA graphics card driver.
    Run the following command to install the NVIDIA graphics card driver:
    sudo apt install nvidia-driver-418
    After the installation is completed, run the following command to check whether the installation is successful:
    nvidia-smi
    If the following result is returned, the installation is successful.
    
    
    
    2. Configure the conda environment.
    Run the following commands to configure the conda environment:
    wget https://repo.anaconda.com/miniconda/Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
    chmod +x Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
    ./Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
    rm Miniconda3-py39\\_4.11.0-Linux-x86\\_64.sh
    3. Compile the ~/.condarc file to add the following software source information and replace the conda software source with the Qinghua source.
    channels:
    
    - defaults
    
    show\\_channel\\_urls: true
    
    default\\_channels:
    
    - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
    
    - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
    
    - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
    
    custom\\_channels:
    
    conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    
    simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
    4. Run the following command to set the pip source to the Tencent Cloud image source.
    pip config set global.index-url https://mirrors.cloud.tencent.com/pypi/simple
    5. Install PyTorch.
    Run the following command to install PyTorch:
    conda install pytorch torchvision cudatoolkit=11.4 -c pytorch --yes
    Run the following commands to view whether PyTorch is installed successfully:
    python
    import torch
    print(torch.cuda.is_avaliable())
    If the following result is returned, PyTorch is installed successfully:
    
    
    

    Preparing the experiment data

    The test task in this training is an image classification task and uses the flower image classification dataset in the Tencent Cloud online document. The dataset contains five classes of flowers and is 218 MB in size. Below are the sampled dataset results (examples of images of flowers in each class):
    
    
    
    The data of each class in the raw dataset is stored in the folder of the corresponding class. You need to convert it to the standard format of ImageNet and divide the training and verification datasets at the ratio of 4:1. Use the following code to convert the format:
    # split data into train set and validation set, train:val=scale
    
    import shutil
    
    import os
    
    import math
    
    scale = 4
    
    data\\_path = '../raw'
    
    data\\_dst = '../train\\_val'
    
    #create imagenet directory structure
    
    os.mkdir(data\\_dst)
    
    os.mkdir(os.path.join(data\\_dst, 'train'))
    
    os.mkdir(os.path.join(data\\_dst, 'validation'))
    
    for item in os.listdir(data\\_path):
    
    item\\_path = os.path.join(data\\_path, item)
    
    if os.path.isdir(item\\_path):
    
    train\\_dst = os.path.join(data\\_dst, 'train', item)
    
    val\\_dst = os.path.join(data\\_dst, 'validation', item)
    
    os.mkdir(train\\_dst)
    
    os.mkdir(val\\_dst)
    
    files = os.listdir(item\\_path)
    
    print(f'Class {item}:\\n\\t Total sample count is {len(files)}')
    
    split\\_idx = math.floor(len(files) \\* scale / ( 1 + scale ))
    
    print(f'\\t Train sample count is {split\\_idx}')
    
    print(f'\\t Val sample count is {len(files) - split\\_idx}\\n')
    
    for idx, file in enumerate(files):
    
    file\\_path = os.path.join(item\\_path, file)
    
    if idx <= split\\_idx:
    
    shutil.copy(file\\_path, train\\_dst)
    
    else:
    
    shutil.copy(file\\_path, val\\_dst)
    
    print(f'Split Complete. File path: {data\\_dst}')
    Below is the dataset overview:
    Class roses:
    
    Total sample count is 641
    
    Train sample count is 512
    
    Validation sample count is 129
    
    Class sunflowers:
    
    Total sample count is 699
    
    Train sample count is 559
    
    Validation sample count is 140
    
    Class tulips:
    
    Total sample count is 799
    
    Train sample count is 639
    
    Validation sample count is 160
    
    Class daisy:
    
    Total sample count is 633
    
    Train sample count is 506
    
    Validation sample count is 127
    
    Class dandelion:
    
    Total sample count is 898
    
    Train sample count is 718
    
    Validation sample count is 180
    To accelerate the training process, you need to further convert the dataset to a GPU-friendly format such as NVIDIA Data Loading Library (DALI). The DALI library can use GPU to replace CPU to accelerate data preprocessing. When data in the ImageNet format already exists, you can simply run the following command to use DALI:
    git clone https://github.com/ver217/imagenet-tools.git
    
    cd imagenet-tools && python3 make\\_tfrecords.py \\
    
    --raw\\_data\\_dir="../train\\_val" \\
    
    --local\\_scratch\\_dir="../train\\_val\\_tfrecord" && \\
    
    python3 make\\_idx.py --tfrecord\\_root="../train\\_val\\_tfrecord"

    Model training result

    To facilitate subsequent training of large distributed models, this document describes how to train and develop a model based on the distributed training framework Colossal-AI. Colossal-AI provides a set of easy-to-use APIs, which enables you to easily perform data, model, pipeline, and mixed parallel training.
    Based on the demo provided by Colossal-AI, this document uses ViT integrated in the pytorch-image-models repository for implementation. The minimum vit\\_tiny\\_patch16\\_224 model at a resolution of 224*224 is used, where each sample is divided into 16 patches.
    1. Run the following command to install Colossal-AI and pytorch-image-models as instructed in Start Locally:
    pip install colossalai==0.1.5+torch1.11cu11.3 -f https://release.colossalai.org
    pip install timm
    2. Write the following model training code based on the demo provided by Colossal-AI:
    from pathlib import Path
    
    from colossalai.logging import get\\_dist\\_logger
    
    import colossalai
    
    import torch
    
    import os
    
    from colossalai.core import global\\_context as gpc
    
    from colossalai.utils import get\\_dataloader, MultiTimer
    
    from colossalai.trainer import Trainer, hooks
    
    from colossalai.nn.metric import Accuracy
    
    from torchvision import transforms
    
    from colossalai.nn.lr\\_scheduler import CosineAnnealingLR
    
    from tqdm import tqdm
    
    from titans.utils import barrier\\_context
    
    from colossalai.nn.lr\\_scheduler import LinearWarmupLR
    
    from timm.models import vit\\_tiny\\_patch16\\_224
    
    from titans.dataloader.imagenet import build\\_dali\\_imagenet
    
    from mixup import MixupAccuracy, MixupLoss
    
    def main():
    
    parser = colossalai.get\\_default\\_parser()
    
    args = parser.parse\\_args()
    
    colossalai.launch\\_from\\_torch(config='./config.py')
    
    logger = get\\_dist\\_logger()
    
    # build model
    
    model = vit\\_tiny\\_patch16\\_224(num\\_classes=5, drop\\_rate=0.1)
    
    # build dataloader
    
    root = os.environ.get('DATA', '../train\\_val\\_tfrecord')
    
    train\\_dataloader, test\\_dataloader = build\\_dali\\_imagenet(
    
    root, rand\\_augment=True)
    
    # build criterion
    
    criterion = MixupLoss(loss\\_fn\\_cls=torch.nn.CrossEntropyLoss)
    
    # optimizer
    
    optimizer = torch.optim.SGD(
    
    model.parameters(), lr=0.1, momentum=0.9, weight\\_decay=5e-4)
    
    # lr\\_scheduler
    
    lr\\_scheduler = CosineAnnealingLR(
    
    optimizer, total\\_steps=gpc.config.NUM\\_EPOCHS)
    
    engine, train\\_dataloader, test\\_dataloader, \\_ = colossalai.initialize(
    
    model,
    
    optimizer,
    
    criterion,
    
    train\\_dataloader,
    
    test\\_dataloader,
    
    )
    
    # build a timer to measure time
    
    timer = MultiTimer()
    
    # create a trainer object
    
    trainer = Trainer(engine=engine, timer=timer, logger=logger)
    
    # define the hooks to attach to the trainer
    
    hook\\_list = [
    
    hooks.LossHook(),
    
    hooks.LRSchedulerHook(lr\\_scheduler=lr\\_scheduler, by\\_epoch=True),
    
    hooks.AccuracyHook(accuracy\\_func=MixupAccuracy()),
    
    hooks.LogMetricByEpochHook(logger),
    
    hooks.LogMemoryByEpochHook(logger),
    
    hooks.LogTimingByEpochHook(timer, logger),
    
    hooks.TensorboardHook(log\\_dir='./tb\\_logs', ranks=[0]),
    
    hooks.SaveCheckpointHook(checkpoint\\_dir='./ckpt')
    
    ]
    
    # start training
    
    trainer.fit(train\\_dataloader=train\\_dataloader,
    
    epochs=gpc.config.NUM\\_EPOCHS,
    
    test\\_dataloader=test\\_dataloader,
    
    test\\_interval=1,
    
    hooks=hook\\_list,
    
    display\\_progress=True)
    
    if \\_\\_name\\_\\_ == '\\_\\_main\\_\\_':
    
    main()
    Below is the specific model configuration:
    from colossalai.amp import AMP\\_TYPE
    
    BATCH\\_SIZE = 128
    
    DROP\\_RATE = 0.1
    
    NUM\\_EPOCHS = 200
    
    CONFIG = dict(fp16=dict(mode=AMP\\_TYPE.TORCH))
    
    gradient\\_accumulation = 16
    
    clip\\_grad\\_norm = 1.0
    
    dali = dict(
    
    gpu\\_aug=True,
    
    mixup\\_alpha=0.2
    
    )
    Below is the model execution process. Each epoch time is within 20s:
    
    
    
    The result shows that the highest accuracy of the model with the verification dataset is 66.62%. You can also increase the number of model parameters; for example, you can change the model to `v.
    
    
    .

    Summary

    The biggest problem encountered in this example was that cloning from GitHub was very slow. To solve this, a tunnel and ProxyChains were used for acceleration. However, such operations violated the CVM use rules and caused a period of unavailability. Eventually, this problem was solved by deleting the proxy and submitting a ticket. Using a public network proxy doesn't comply with the CVM use regulations. To guarantee the stable operations of your business, do not violate the regulations.

    References

    [1] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). [2] NVIDIA/DALI [3] Bian, Zhengda, et al. "Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training." arXiv preprint arXiv:2110.14883 (2021).
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support