tencent cloud

Feedback

Deployment and Practices

Last updated: 2022-12-09 15:08:50

    This document describes how to deploy and use TACO-Training on GPU instances.

    Notes

    • Currently, TACO-Training is supported only by Cloud GPU Service.
    • Currently, the three acceleration components of TACO-Training have been integrated into the same Docker image, which can be pulled at the following address:
      ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      

    Directions

    Preparing the instance environment

    1. Create at least two instances meeting the following requirements as instructed in Purchasing NVIDIA GPU Instance:
    • Instance: We recommend you select the Computing GT4 GT4.41XLARGE948 8-card model.
    • Image: Select CentOS 7.8 or Ubuntu 18.04 or later and select Automatically install GPU driver on the backend to use the automatic installation feature to install the GPU driver.
      Automatic installation of CUDA and cuDNN is not required for this deployment, and you can choose as needed.
    • System disk: We recommend you configure a system disk of 100 GB or above in size to store the Docker image and intermediate state files generated during training.
    1. Install Docker as instructed in the corresponding document based on the operating system type of your instance:
      OS Description
      CentOS For detailed directions, see Install Docker Engine on CentOS.
      Ubuntu For detailed directions, see Install Docker Engine on Ubuntu.
    2. Install nvidia-docker as instructed in Docker.

    Using TTF

    Installation

    1. Run the following command as the root user and install TTF through a Docker image:

      docker pull ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      
    2. Run the following command to start Docker:

      docker run -it --rm --gpus all --shm-size=32g --ulimit memlock=-1 --ulimit stack=67108864 --name ttf1.15-gpu ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      
    3. Run the following command to view the TTF version:

      pip show ttensorflow
      

    Model adaptation

    Dynamic embedding

    Below is the code for the native static embedding of TF and dynamic embedding of TTF:

    deep_dynamic_variables = tf.get_variable(
       name="deep_dynamic_embeddings",
       initializer=tf.compat.v1.random_normal_initializer(0, 0.005),
       shape=[100000000, self.embedding_size])
    deep_sparse_weights = tf.nn.embedding_lookup(
       params=deep_dynamic_variables,
       ids=ft_sparse_val,
       name="deep_sparse_weights")
    deep_embedding = tf.gather(deep_sparse_weights, ft_sparse_idx)
    deep_embedding = tf.reshape(
       deep_embedding,
       shape=[self.batch_size, self.feature_num * self.embedding_size])
    

    By comparing the code, you can see that TTF mainly modifies the following two parts:

    Mixed precision

    Mixed precision can be implemented by rewriting the code of the optimizer or modifying environment variables:

    • Through code modification:

      opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
      
    • Through environment variable modification:

      export TF_ENABLE_AUTO_MIXED_PRECISION=1
      

    The loss change curve of resnet50 training with the ImageNet dataset by TTF at mixed precisions is as follows:

    XLA

    XLA can be configured through the code or environment variables:

    • Through code modification:

      config = tf.ConfigProto()
      config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
      sess = tf.Session(config=config)
      
    • Through environment variable modification:

      TF_XLA_FLAGS=--tf_xla_auto_jit=1
      

    Demo

    Before running the demo:

    1. Run the following command to enter the demo directory:

      cd /opt/dynamic-embedding-demo
      
    2. Run the following command to download the dataset:

      bash download_dataset.sh 
      

    You can get started with TTF quick by using the following demo:

    benchmark

    展开&收起

    This demo is used to compare and test the performance of the dynamic embedding and the native static embedding:

    # Enter the `benchmark` directory
    cd benchmark
    # Run the demo according to the default configuration
    python train.py
    # You need to delete the local dataset cache files every time you modify `batch size`.
    rm -f .index .data-00000-of-00001
    python train.py --batch_size=16384
    # Use the static embedding and dynamic embedding to train the DeepFM model respectively
    python train.py --batch_size=16384 --is_dynamic=False
    python train.py --batch_size=16384 --is_dynamic=True
    # Adjust the number of fully connected layers of the deep part
    python train.py --batch_size=16384 --dnn_layer_num=12
    

    This demo shows how to use the dynamic embedding in ps mode.

    cd ps && bash start.sh
    

    Estimator

    展开&收起

    This demo shows how to use the dynamic embedding in estimator mode.

    cd estimator && bash start.sh
    

    Using LightCC

    Installation

    1. Run the following command as the root user and install LightCC through a Docker image:

      docker pull ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      
    2. Run the following command to start Docker:

      docker run --network host -it --rm --gpus all --privileged --shm-size=32g --ulimit memlock=-1 --ulimit stack=67108864 --name lightcc ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      
    3. Run the following command to view the LightCC version:

      pip show light-horovod
      
    Note

    When the kernel protocol stack is used for NCCL network communication, or if the runtime environment of the HARP protocol stack is not configured, you need to move the /usr/lib/x86_64-linux-gnu/libnccl-net.so file in the image to a path other than the system lib directory, such as the /root directory, as the system will check whether the HARP configuration file exists in a certain directory in the lib directory during init and will report an error if the file doesn't exist.

    Environment variable configuration

    LightCC environment variables are as detailed below, which can be configured as needed:

    Environment Variable Default Value Description
    LIGHT_2D_ALLREDUCE 0 Whether to use the 2D-Allreduce algorithm
    LIGHT_INTRA_SIZE 8 Number of GPUs in a 2D-Allreduce group
    LIGHT_HIERARCHICAL_THRESHOLD 1073741824 Threshold for 2D-Allreduce in bytes. Only data of a size less than this threshold can use 2D-Allreduce.
    LIGHT_TOPK_ALLREDUCE 0 Whether to use TOPK to compress the communication data
    LIGHT_TOPK_RATIO 0.01 Compression ratio of TOPK
    LIGHT_TOPK_THRESHOLD 1048576 Threshold for TOPK compression in bytes. Only communication data of a size greater than or equal to this threshold can be compressed through TOPK.
    LIGHT_TOPK_FP16 0 Whether to convert the values of the compressed communication data to FP16

    Demo

    1. After creating two GPU instances, install LightCC and configure environment variables in the above steps.
    2. Run the following commands in the container to configure passwordless SSH login for the instances:
      # Allow the root user to use the SSH service and start the service (default port: 22)
      sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
      service ssh start && netstat -tulpn
      
    # Change the default SSH port in the container to 2222 to avoid conflicts with the host
    sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
    service ssh restart && netstat -tulpn
    
    # Set `root passwd`
    passwd root
    
    # Generate an SSH key
    ssh-keygen
    
    # Configure SSH to use port 2222 by default
    # Create `~/.ssh/config`, add the following content, and save and exit the file:
    # Note: The IP used here is the IP displayed in `ifconfig eth0` of the two instances.
    Host gpu1
    hostname 10.0.2.8
    port 2222
    Host gpu2
    hostname 10.0.2.9
    port 2222
    
    # Configure mutual passwordless login for the instances and local passwordless login for each instance
    ssh-copy-id gpu1
    ssh-copy-id gpu2
    
    # Test whether passwordless login is configured successfully
    ssh gpu1 
    ssh gpu2
    
    1. Run the following command to download the benchmark test script of Horovod:

      wget https://raw.githubusercontent.com/horovod/horovod/master/examples/tensorflow/tensorflow_synthetic_benchmark.py
      
    2. Run the following command to start multi-server training benchmark based on ResNet-50:

      /usr/local/openmpi/bin/mpirun -np 16 -H gpu1:8,gpu2:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0  -x HOROVOD_CYCLE_TIME=0 -x LIGHT_2D_ALLREDUCE=1 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=256
      

    Here, the command parameters are used for an 8-card model. To configure another model, modify the -np and -H parameters. Other parameters are as detailed below:

    • NCCL_ALGO=RING: Select the ring algorithm as the communication algorithm in NCCL.
    • NCCL_DEBUG=INFO: Enable debugging output in NCCL.
    • -mca btl_tcp_if_include eth0: Select the eth0 device as network device for MPI multi-server communication. As some ENIs cannot communicate, you need to specify the ENI if there are multiple ones; otherwise, an error will occur if MPI chooses an ENI that cannot communicate.

    The reference throughput rates of LightCC multi-server training benchmark in two GT4.41XLARGE948 instances are as detailed below:

    Model: CVM GT4.41XLARGE948 (A100 * 8 + 50G VPC)
    GPU driver: 460.27.04
    Container: ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
    network ResNet50 Throughput(images/sec)
    horovod 0.21.3 lightcc 3.0.0
    NCCL + HRAP 8970.7 10229.9
    NCCL + kernel socket 5421.9 7183.5

    Using HARP

    Preparing the instance environment

    1. Run the following command as the root user to modify cmdline of the kernel and configure a 50 GB huge page memory.
      sed -i '/GRUB_CMDLINE_LINUX/ s/"$/ default_hugepagesz=1GB hugepagesz=1GB hugepages=50"/' /etc/default/grub
      
    Note

    You can configure hugepages=50 for an 8-card instance or hugepages= (number of GPU cards * 5 + 10) for other models.

    2. Run the following command based on your operating system version to make the configuration take effect and restart the instance:

    sudo update-grub2 && sudo reboot
    

    1. Bind ENIs as follows. The number of ENIs should be the same as the number of GPU cards.
      3.1 Log in to the CVM console, select More > IP/ENI > Bind ENI on the right of the target instance.
      3.2 In the Bind ENI pop-up window, select created ENIs or create new ones and bind them as needed.
      3.3 Click OK.
    2. Run the following command to initialize the configuration information.
      curl -s -L http://mirrors.tencent.com/install/GPU/taco/harp_setup.sh | bash
      

    If the input result contains "Set up HARP successfully", and the ztcp*.conf configuration file is generated in the /usr/local/tfabric/tools/config directory, the configuration has been completed successfully.

    Installation

    1. Run the following command as the root user and install HARP through a Docker image:

      docker pull ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      
    2. Run the following command to start Docker:

      docker run -it --rm --gpus all --privileged --net=host -v /sys:/sys -v /dev/hugepages:/dev/hugepages -v /usr/local/tfabric/tools:/usr/local/tfabric/tools ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
      

    Demo

    1. After creating two GPU instances, configure the instance environment and install HARP in the above steps.
    2. Run the following commands in Docker to configure passwordless SSH login for the instances:
      # Allow the root user to use the SSH service and start the service (default port: 22)
      sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
      service ssh start && netstat -tulpn
      
    # Change the default SSH port in the container to 2222 to avoid conflicts with the host
    sed -i 's/#Port 22/Port 2222/' /etc/ssh/sshd_config
    service ssh restart && netstat -tulpn
    
    # Set `root passwd`
    passwd root
    
    # Generate an SSH key
    ssh-keygen
    
    # Configure SSH to use port 2222 by default
    # Create `~/.ssh/config`, add the following content, and save and exit the file:
    # Note: The IP used here is the IP displayed in `ifconfig eth0` of the two instances.
    Host gpu1
    hostname 10.0.2.8
    port 2222
    Host gpu2
    hostname 10.0.2.9
    port 2222
    
    # Configure mutual passwordless login for the instances and local passwordless login for each instance
    ssh-copy-id gpu1
    ssh-copy-id gpu2
    
    # Test whether passwordless login is configured successfully
    ssh gpu1 
    ssh gpu2
    
    1. Run the following command to download the benchmark script of Horovod:

      wget https://raw.githubusercontent.com/horovod/horovod/master/examples/tensorflow/tensorflow_synthetic_benchmark.py
      
    2. Run the following command to start multi-server training benchmark based on ResNet-50:

      /usr/local/openmpi/bin/mpirun -np 16 -H gpu1:8,gpu2:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0  -x HOROVOD_CYCLE_TIME=0 -x LIGHT_2D_ALLREDUCE=1 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 tensorflow_synthetic_benchmark.py --model=ResNet50 --batch-size=256
      

    Here, the command parameters are used for an 8-card model. To configure another model, modify the -np and -H parameters. Other parameters are as detailed below:

    • NCCL_ALGO=RING: Select the ring algorithm as the communication algorithm in NCCL.
    • NCCL_DEBUG=INFO: Enable debugging output in NCCL. After it is enabled, HARP will output the following content:
    • -mca btl_tcp_if_include eth0: Select the eth0 device as network device for MPI multi-server communication. As some ENIs cannot communicate, you need to specify the ENI if there are multiple ones; otherwise, an error will occur if MPI chooses an ENI that cannot communicate.
      After NCCL initialization, you can view the network output:
    1. HARP is integrated to NCCL as a plugin and is enabled automatically without any configuration required. To disable HARP, run the following command in the container:
      mv /usr/lib/x86_64-linux-gnu/libnccl-net.so /usr/lib/x86_64-linux-gnu/libnccl-net.so_bak
      
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support