

Metric Name | Meaning | Unit | Level |
GpuUtil | Evaluates compute capacity consumed by the workload (percentage of non-idle time) | % | per-GPU |
gpu_mem_used | Evaluates the amount of GPU memory used by the workload | MB | per-GPU |
GpuMemUsage | Evaluates GPU memory usage as a percentage of total GPU memory | % | per-GPU |
GpuPowDraw | Evaluates GPU power consumption | W | per-GPU |
GpuTemp | Evaluates GPU thermal status | °C | per-GPU |
GpuEncUtil | Evaluates encoder usage percentage | % | per-GPU |
GPU decoder utilization | Evaluates decoder usage percentage | % | per-GPU |

Metric Name | Metric Name | Metric Description | Unit | Level |
GpuMemUsage | GPU memory utilization | GPU memory utilization | % | per-GPU |
gpu_mem_used | GPU memory usage | Evaluates the amount of GPU memory used by the workload | MB | per-GPU |
GpuPowdraw | GPU power usage | GPU power draw | W | per-GPU |
GpuTemp | GPU temperature | Evaluates GPU thermal status | °C | per-GPU |
GpuUtil | GPU utilization | Evaluates compute capacity consumed by the workload (percentage of non-idle time) | % | per-GPU |
GpuEncUtil | GPU encoder utilization | GPU encoder utilization | % | per-GPU |
GpuDecUtil | GPU decoder utilization | GPU decoder utilization | % | per-GPU |


Metric Name | Recommended Alarm Threshold | Description | Suggested Action |
GPUPowUsage | <= 0 | When power usage is less than 0, it indicates a potential "Unknown Error" in power reading, which will affect normal GPU operation. | Run the nvidia-smi command to check if the GPU power displays ERR, or run nvidia-smi -i <target gpu> -q |grep "Power Draw" to check if it shows Unknown Error. If this occurs, try restarting the machine to recover and update the driver to monitor the status. If the issue persists after the restart, Submit a Ticket to contact Tencent Cloud support. |
GPUTemp | > 80 (Sustained for 5 minutes) | Excessive GPU temperature may cause GPU SlowDown, impacting business performance. | High load may cause high GPU temperature. Try restarting the instance to recover. If it cannot be recovered, Submit a Ticket to contact Tencent Cloud support. |
gpu_retired_page_pending | = 1 | Pre-Ampere architecture GPUs experienced an ECC ERROR, application processes were killed, and the GPU card is in a pending state. | Run the nvidia-smi -i <target gpu> -q -d PAGE_RETIREMENT command to check if any GPU card is in a pending state. Reset the GPU card or restart the instance to recover. If restarting does not resolve the issue, Submit a Ticket to contact Tencent Cloud support. |
gpu_ecc_teminate_app | = 1 | Ampere and later architecture GPUs experienced an ECC ERROR, application processes were killed, and the GPU card is in a pending state. | Run the nvidia-smi -i <target gpu> -q -d ROW_REMAPPER command to check if any GPU card is in a Pending state. Reset the GPU card or restart the instance to recover. If restarting does not resolve the issue, Submit a Ticket to contact Tencent Cloud support. |
GPUMemUsage | Monitor only | - | Evaluate the impact of the load on video memory usage. |
GPU Utili | Monitor only | - | Evaluate the impact of the load on GPU streaming multiprocessor usage. |
Feedback