Tencent Cloud Incident Report (edgeone)
Impact
During the incident, a portion of nodes associated with customers' Layer 7 services were unable to respond to customers properly requests, resulting in disruptions to their regular business operations. We sincerely apologize for any inconvenience this may have caused. A thorough technical examination of the complete incident process has been diligently carried out.
Issue Summary
| 17:25 Pacific Time | Incorporating customer feedback on the issue, Tencent Cloud engineers identified the cause of the anomaly in Layer 7 business access. It was determined that during the observation of system call security policy distribution, the security interception policy was erroneously dispatched to the wrong network production node. |
| 17:35 Pacific Time | Tencent Cloud engineers assessed the scope of the issue and compiled a list of affected services and removed the security interception policies that were erroneously issued to the network production nodes, then verified the restoration of business operations. |
| 17:50 Pacific Time | Following observation and verification, it has been noted that the initial set of customer services has commenced a gradual return to normalcy. |
| 18:08 Pacific Time | After further observation and verification, the majority of customer services had resumed normal access. Tencent Cloud engineers began assisting customers with switching back and are systematically confirming the restoration status of businesses within the impacted scope, as well as verifying the business logic of the monitoring system. |
| 19:14 Pacific Time | Upon final observation and verification, all affected customer services have been restored to their normal accessibility, and the proper functioning of the observation system's operational logic has been verified. |
Root Cause
To ensure the smooth deployment of default security policies, the EO observation system periodically executes automated scripts on test nodes. This process validates the functionality and effectiveness of the security policy distribution interface. During this execution, some network production nodes that were not within the intended scope were mistakenly included in the verification range due to incorrect identification. This resulted in the monitoring system erroneously issuing security interception policies to these network production nodes, leading to abnormal access to customer Layer 7 services.
Improvement Action Plans
After analysis, we will strengthen and improve the following aspects:
1. Short-term: Initiate a circuit breaker mechanism, temporarily suspending the automated verification function of the monitoring system for the security policy issuance interface's effectiveness, and adopting manual verification instead.
-- Completed
2. Long-term: Optimize the logic for security policy effectiveness verification, strictly limit the scope of batch activation, and introduce verification and approval steps for high-risk security policies to prevent similar issues from recurring.
-- To be completed by 08/30/2023
Tencent Cloud profoundly apologizes for any inconvenience caused by this incident. We will ensure all aforementioned lessons learned and all the aforementioned actions are taken to continue to provide the best service to Tencent Cloud customers. Thank you for your support and trust in Tencent Cloud as always!
August 22, 2023