Release Notes and Announcements
- Release Notes
- Announcements
- Release Notes
Product Introduction
Purchase Guide
- Purchase Instructions
- Purchase a TKE General Cluster
- Purchasing Native Nodes
- Purchasing a Super Node
Getting Started
Cluster Configuration
- General Cluster Overview
- Cluster Management
- Network Management
- Storage Management
- Node Management
- GPU Resource Management
- Remote Terminals
Application Configuration
- Workload Management
- Service and Configuration Management
- Component and Application Management
- Auto Scaling
- Container Login Methods
Observability Configuration
- Ops Observability
- Cost Insights and Optimization
Scheduler Configuration
- Scheduling Component Overview
- Resource Utilization Optimization Scheduling
- Business Priority Assurance Scheduling
- QoS Awareness Scheduling
Security and Stability
- TKE Security Group Settings
- Identity Authentication and Authorization
- Application Security
Multi-cluster Management
- Planned Upgrade
- Backup Center
Cloud Native Service Guide
- Cloud Service for etcd
- TMP
- TKE Serverless Cluster Guide
- TKE Registered Cluster Guide
Use Cases
- Cluster
- Serverless Cluster
- Scheduling
- Security
- Service Deployment
- Network
- Release
- Logs
- Monitoring
- OPS
- Terraform
- DevOps
- Auto Scaling
- Containerization
- Cost Management
- Hybrid Cloud
- AI
Troubleshooting
API Documentation
- History
- Introduction
- API Category
- Making API Requests
- Elastic Cluster APIs
- Resource Reserved Coupon APIs
- Cluster APIs
- Third-party Node APIs
- Relevant APIs for Addon
- Network APIs
- Node APIs
- Node Pool APIs
- TKE Edge Cluster APIs
- Cloud Native Monitoring APIs
- Scaling group APIs
- Super Node APIs
- Other APIs
- Data Types
- Error Codes
- TKE API 2022-05-01
FAQs
- TKE General Cluster
- TKE Serverless Cluster
- About OPS
- Hidden Danger Handling
- About Services
- Image Repositories
- About Remote Terminals
- Event FAQs
- Resource Management
Service Agreement
- TKE Service Level Agreement
- TKE Serverless Service Level Agreement
Contact Us
Glossary

Obtaining GPU Monitoring Metrics

Download

Modo Foco

Tamanho da Fonte

Última atualização: 2024-12-25 15:00:17

Add-On Overview
The Tencent Kubernetes Engine (TKE) add-on elastic-gpu-exporter has been developed for obtaining GPU-related monitoring metrics, including:
GPU utilization
Pod/Container GPU resource utilization
Deployment Mode
elastic-gpu-exporter is deployed to a cluster using DaemonSet.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
  labels:
    app: elastic-gpu-exporter
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      name: gpu-manager-ds
      app: nano-gpu-exporter
  template:
    metadata:
      name: elastic-gpu-exporter
      labels:
        name: gpu-manager-ds
        app: nano-gpu-exporter
    spec:
      nodeSelector:
        qgpu-device-enable: enable
      serviceAccount: elastic-gpu-exporter
      hostNetwork: true
      hostPID: true
      hostIPC: true
      containers:
        - image: ccr.ccs.tencentyun.com/tkeimages/elastic-gpu-exporter:v1.0.8
          imagePullPolicy: Always
          args:
            - --node=$(NODE_NAME)
          env:
            - name: "PORT"
              value: "5678"
            - name: "NODE_NAME"
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          name: elastic-gpu-exporter
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]
          volumeMounts:
            - name: cgroup
              readOnly: true
              mountPath: "/host/sys"
      volumes:
        - name: cgroup
          hostPath:
            type: Directory
            path: "/sys"
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: elastic-gpu-exporter
rules:
  - apiGroups:
      - ""
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - events
    verbs:
      - create
      - patch
  - apiGroups:
      - ""
    resources:
      - pods
    verbs:
      - update
      - patch
      - get
      - list
      - watch
  - apiGroups:
      - ""
    resources:
      - bindings
      - pods/binding
    verbs:
      - create
  - apiGroups:
      - ""
    resources:
      - configmaps
    verbs:
      - get
      - list
      - watch
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: elastic-gpu-exporter
subjects:
  - kind: ServiceAccount
    name: elastic-gpu-exporter
    namespace: kube-system
---
apiVersion: v1
kind: Service
metadata:
  name: elastic-gpu-exporter
  namespace: kube-system
  annotations:
    prometheus.io/scrape: "true"
  labels:
    kubernetes.io/cluster-service: "true"
spec:
  clusterIP: None
  ports:
    - name: elastic-gpu-exporter
      port: 5678
      protocol: TCP
      targetPort: 5678
  selector:
    app: nano-gpu-exporter
﻿
Checking Running Status
After deployment, a DaemonSet of elastic-gpu-exporter is generated in the cluster:
NAME                   DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
elastic-gpu-exporter   1         1         1       1            1           <none>          3m36s
A running elastic-gpu-exporter Pod will be present on a qualified node:
NAME                        READY   STATUS    RESTARTS   AGE
elastic-gpu-exporter-dblqm  1/1     Running   0          6s
Obtaining Monitoring Metrics
The node running the elastic-gpu-exporter service will be output to the /metrics path, so you can run the following command to obtain monitoring metrics:
$ curl NodeIP:5678/metrics
GPU Metrics
gpu_xxx
GPU Metrics
gpu_core_usage
Actual computing power usage of the GPU
gpu_mem_usage
Actual video memory usage of the GPU
gpu_core_utilization_percentage
GPU computing power utilization
gpu_mem_utilization_percentage
GPU video memory utilization
The GPU metrics format is as follows: 
gpu_core_usage{card="0",node="10.0.66.4"} 0
Note:
"card" represents the GPU serial number, and "node" represents the node where the GPU is located.
Pod Metrics
pod_xxx
Pod Metrics
pod_core_usage
Actual computing power usage of the Pod
pod_mem_usage
Actual video memory usage of the Pod
pod_core_utilization_percentage
Percentage of the computing power used by the Pod to the requested computing power
pod_mem_utilization_percentage
Percentage of the video memory used by the Pod to the requested video memory
pod_core_occupy_node_percentage
Percentage of the computing power used by the Pod to the total computing power of the node
pod_mem_occupy_node_percentage
Percentage of the video memory used by the Pod to the total video memory of the node
pod_core_request
Computing power requested by the Pod
pod_mem_request
Video memory requested by the Pod
The Pod metrics format is as follows: 
pod_core_usage{namespace="default",node="10.0.66.4",pod="7a2fa737-eef1-4801-8937-493d7efb16b7"} 0
Note:
"namespace" represents the namespace of the Pod, "node" represents the node where the Pod is located, and "pod" represents the name of the Pod.
Container Metrics
container_xxx
Container Metrics
container_gpu_utilization
Actual computing power of the container
container_gpu_memory_total
Actual video memory usage of the container
container_core_utilization_percentage
Percentage of the computing power used by the container to the requested computing power
container_mem_utilization_percentage
Percentage of the video memory used by the container to the requested video memory
container_request_gpu_memory
Requested video memory of the container
container_request_gpu_utilization
Requested computing power of the container
The container metrics format is as follows: 
container_gpu_utilization{container="cuda",namespace="default",node="10.0.66.4",pod="cuda"} 0
Note:
"container" represents the container name, "namespace" represents the namespace of the container, "node" represents the node where the container is located, and "pod" represents the name of the Pod where the container is located.

Ajuda e Suporte

Esta página foi útil?

Você também pode entrar em contato com a Equipe de vendas ou Enviar um tíquete em caso de ajuda.

comentários

gpu_xxx	GPU Metrics
gpu_core_usage	Actual computing power usage of the GPU
gpu_mem_usage	Actual video memory usage of the GPU
gpu_core_utilization_percentage	GPU computing power utilization
gpu_mem_utilization_percentage	GPU video memory utilization

pod_xxx	Pod Metrics
pod_core_usage	Actual computing power usage of the Pod
pod_mem_usage	Actual video memory usage of the Pod
pod_core_utilization_percentage	Percentage of the computing power used by the Pod to the requested computing power
pod_mem_utilization_percentage	Percentage of the video memory used by the Pod to the requested video memory
pod_core_occupy_node_percentage	Percentage of the computing power used by the Pod to the total computing power of the node
pod_mem_occupy_node_percentage	Percentage of the video memory used by the Pod to the total video memory of the node
pod_core_request	Computing power requested by the Pod
pod_mem_request	Video memory requested by the Pod

container_xxx	Container Metrics
container_gpu_utilization	Actual computing power of the container
container_gpu_memory_total	Actual video memory usage of the container
container_core_utilization_percentage	Percentage of the computing power used by the container to the requested computing power
container_mem_utilization_percentage	Percentage of the video memory used by the container to the requested video memory
container_request_gpu_memory	Requested video memory of the container
container_request_gpu_utilization	Requested computing power of the container

tencent cloud

Tencent Kubernetes Engine

Obtaining GPU Monitoring Metrics

Add-On Overview

Deployment Mode

Checking Running Status

Obtaining Monitoring Metrics

GPU Metrics

Pod Metrics

Container Metrics

Ajuda e Suporte