tencent cloud

masukan

Running TF Training Job

Terakhir diperbarui:2023-05-19 17:12:40

    This document describes how to run a TF training job.

    Prerequisites

    • TF Operator has been installed in your AI environment.
    • Your AI environment has GPU resources.

    Directions

    The following steps are based on the official distributed training examples in parameter server/worker mode of TF-Operator.

    Preparing the training code

    The code sample dist_mnist.py at the official website of Kubeflow is used.

    Creating a training image

    Image creation is easy. You only need to get an official image based on TensorFlow 1.5.0, copy the above code to the image, and configure entrypoint.

    Note:

    If entrypoint is not configured, you can also configure the container startup command when submitting a TFJob.

    Submitting the job

    1. Prepare a TFJob YAML file to define two parameter servers and four workers.

      Note

      You need to replace the <training image=""> placeholder with the address of the uploaded training image.

      apiVersion: "kubeflow.org/v1"
      kind: "TFJob"
      metadata:
      name: "dist-mnist-for-e2e-test"
      spec:
      tfReplicaSpecs:
       PS:
         replicas: 2
         restartPolicy: Never
         template:
           spec:
             containers:
               - name: tensorflow
                 image: <training image>
       Worker:
         replicas: 4
         restartPolicy: Never
         template:
           spec:
             containers:
               - name: tensorflow
                 image: <training image>
      
    2. Run the following command to use kubectl to submit the TFJob:

      kubectl create -f ./tf_job_mnist.yaml
      
    3. Run the following command to view the job status:

      kubectl get tfjob dist-mnist-for-e2e-test -o yaml
      kubectl get pods -l pytorch_job_name=pytorch-tcp-dist-mnist 
      
    Hubungi Kami

    Hubungi tim penjualan atau penasihat bisnis kami untuk membantu bisnis Anda.

    Dukungan Teknis

    Buka tiket jika Anda mencari bantuan lebih lanjut. Tiket kami tersedia 7x24.

    Dukungan Telepon 7x24