Tencent Cloud Incident Report (tcop)
Problem Statement
On Wednesday, April 18, 2023, there was an issue with the database instance IP used by a production environment in the cloud monitoring platform. As a result, certain parts of the cloud monitoring console experienced abnormal functionality. The issue persisted from 17:00 to 17:43 UTC+8. We sincerely apologize for the inconvenience caused by the abnormal status of the cloud monitoring service, which had a negative impact on your user experience.
Incident Background
In a production environment of the cloud monitoring platform, there was an unfortunate incident where a database instance was mistakenly detached from its migration identifier before the migration process was completed. As a result, the database without the necessary identifier became inaccessible through the old IP address, particularly during high-load high-availability (HA) switching scenarios. This resulted in abnormal database connections within the production environment services.
What was the specific reason for the unsuccessful switch?
The high-load condition prompted a routine high-availability master-slave switch for the CDB. However, the database instance had not completed its migration and was mistakenly labelled as migrated. As a result, the previous Virtual IP (VIP) became invalid during the switch, leaving only the new VIP accessible. This discrepancy led to abnormal database connections for the old VIP. Consequently, any traffic that had not completely transitioned to the new IP was unable to access the database, resulting in connection failures and subsequent service unavailability.
Was there any data loss?
We want to assure you that no data loss occurred during the incident.
What happened during the incident?
Impact
1.The alarm console became non-functional, affecting the alarm history display and hindering users from performing regular console operations.
2.Alarm notifications encountered issues with retrieving alarm notification message content, resulting in the failure to send alarm notifications as intended.
3.The Dashboard console became inaccessible, preventing users from accessing monitoring data and displaying error messages indicating operation failures.
Next Steps and Action Plan
The following measures will be implemented to prevent a recurrence of the incident.
1. A thorough review of the migration status for all database instances requiring migration will be conducted. The database instances will be marked as completed only when no access records are associated with the old IP.
2. Accelerated the migration progress to ensure the completion of all pending database instance migrations by the first half of 2023.
3. Implemented standardization of database usage, enhanced monitoring capabilities, performed proactive capacity expansion for high-load instances, and optimized slow query logic to eliminate any inefficiencies.