tencent cloud

Cloud GPU Service

Release Notes and Announcements
Release Notes
Announcements
Product Introduction
Overview
Strengths
Scenarios
Notes
Instance Types
Computing Instance
Rendering Instance
Billing
Billing Overview
Renewal
Getting Started
User Guide
Logging In to Instances
Restarting Instances
Installing NVIDIA Driver
Uninstalling NVIDIA Driver
Upgrading NVIDIA Driver
Using GPU Monitoring and Alarm
Use Cases
Installing NVIDIA Container Toolkit on a Linux Cloud GPU Service
Using Windows Cloud GPU Service to build a Deep Learning Environment
Implementing Image Quality Enhancement with GN7vi Instances
Using Docker to Install TensorFlow and Set GPU/CPU Support
Using GPU Instance to Train ViT Model
Troubleshooting
GPU Instance Troubleshooting Guide
Troubleshooting Common Xid Errors
Collecting Log for GPU Instances
GPU Usage Shows 100%
VNC Login Failures
FAQs
Related Agreement
Special Terms for Committed Sales Model
Contact Us
문서Cloud GPU ServiceTroubleshootingTroubleshooting Common Xid Errors

Troubleshooting Common Xid Errors

PDF
포커스 모드
폰트 크기
마지막 업데이트 시간: 2026-01-19 12:03:00
This document explains what Xid messages are and provides explanations and troubleshooting methods for common Xid errors.

What Are Xid Messages?

Xid messages are error reports printed by the NVIDIA driver to the operating system's kernel log or event log. An Xid message indicates that a GPU error has occurred, typically caused by the driver programming the GPU incorrectly or corruption of commands sent to the GPU.

How to Check Xid Error Information?

When using a GPU instance, you can execute the following command to check for any Xid-related errors and save the result.
dmesg | grep -i xid
If the Xid exceptions on the GPU node are empty during inspection, it indicates that no Xid messages are present.
If the Xid exceptions on the GPU node are not empty during the inspection, you can follow the recommended solutions corresponding to different Xid messages or contact online support.

Troubleshooting Common Xid Errors

Different Xid errors indicate different issues. Based on whether the user can resolve the issue independently, common Xid errors and their corresponding recommendations are classified into two categories below. For a complete list of Xid errors, please refer to the NVIDIA Xid Documentation.

Attempting Self-Resolution

When encountering the following Xid events, you can attempt to resolve them using the recommended solutions below. If the issue persists, you may provide feedback through online support, where Tencent Cloud engineers are available 24/7 to assist you.

XID 48 Error

XID 48: Double Bit ECC Error
This error occurs when the GPU encounters an uncorrectable error. It is also reported to the user application. Typically, resetting the GPU or restarting the CVM instance is required to clear this error.
Recommendation: Restart the instance to recover. If the issue persists after a restart, please contact the platform for troubleshooting. If your business is sensitive to Xid 48 errors, you can request a GPU replacement directly.

XID 79 Error

XID 79: GPU has fallen off the bus
This error is generally caused by GPU driver or hardware issues. Users may observe that the GPU has detached from the instance (GPU loss).
Recommendation: Restart the instance to recover. If the issue persists after a restart, please contact the platform for troubleshooting.

XID 94 Error

XID 94: Contained ECC error
This indicates a contained ECC error on the GPU. Applications using the GPU will stop.
Recommendation: Restart the application to verify if the service returns to normal. If the application fails again, restart the instance. If the issue persists after a restart, please contact the platform for troubleshooting.

XID 95 Error

XID 95: Uncontained ECC error
This indicates an uncontained ECC error on the GPU. Applications using the GPU will stop.
Recommendation: Restart the instance to recover. If the issue persists after a restart, please contact the platform for troubleshooting.

XID 119 Error

XID 119: GSP RPC Timeout
This error is generally caused by the GPU driver triggering a bug in the GPU System Processor (GSP).
Recommendation:
1. Disable GSP. In newer generation instances, NVIDIA GPUs include GSP firmware functionality. GSP is designed to offload GPU initialization and other management tasks. You can follow the steps below to disable GSP: (For more details, please refer to Disabling GSP on the NVIDIA website.)
echo "options nvidia NVreg_EnableGpuFirmware=0" > /etc/modprobe.d/nvidia-gsp.conf cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
If you are using CentOS, Tlinux, or Red Hat systems:
dracut -f --kver $(uname -r)
If you are using Ubuntu or Debian systems:
sudo update-initramfs -u
Restart the machine to verify.
Check if the disabling was successful: Verify whether the relevant value is 0. If it is 0, GSP has been successfully disabled.
grep EnableGpuFirmware /proc/driver/nvidia/params
2. If you prefer not to disable GSP, you can attempt to resolve the issue by switching the driver version:
If you are using 535-series drivers, update to version 535.216.01 or later. If using 550-series drivers, update to version 550.144.03 or later. The newer driver versions address the XID 119 error issue caused by GPU GSP.
Downgrade the driver to the latest stable version of 470 (470.223.02), as this version does not enable GSP by default, avoiding XID 119 errors.

Errors Requiring Platform Support

When encountering the following Xid errors, we recommend reporting them directly via Online Support. Tencent Cloud engineers are available 24/7 to assist you.
Recommendation: Refer to Collecting Log for GPU Instances to gather the necessary logs, and then contact the platform for troubleshooting.

XID 74 Error

XID 74: NVLink ERROR
This error indicates that the GPU has detected an issue with the connection from one GPU to another GPU or through an NVSwitch via NVLink. The issue could stem from the GPU itself or an interconnected GPU card.

XID 92 Error

XID 92: High single-bit ECC error rate
This error indicates a high single-bit ECC error, which may be caused by hardware or driver failure.

도움말 및 지원

문제 해결에 도움이 되었나요?

피드백