tencent cloud

Tencent Cloud Smart Advisor

Release Notes
Product Introduction
Overview
Features
Product Strengths
Scenarios
Customer Cases
Purchase Guide
Getting Started
Using TSA to Execute a Chaos Experiment on CFG
Operation Guide
Operation Guide to TSA-Cloud Architecture
Operation Guide to TSA-Cloud Risk Assessment
Operation Guide to TSA-Chaotic Fault Generator
Operation Guide to TSA-Digital Assets
Permission Management
API Documentation
History
Introduction
API Category
Making API Requests
Other APIs
Task APIs
Cloud Architecture Console APIs
Data Types
Error Codes
FAQs
FAQs: TSA
FAQs: TSA-Cloud Risk Assessment
FAQs: TSA-Cloud Architecture
FAQs: TSA-Chaotic Fault Generator
Related Protocol
Tencent Cloud Smart Advisor Service Level Agreement
PRIVACY POLICY MODULE CHAOTIC FAULT GENERATOR
DATA PRIVACY AND SECURITY AGREEMENT MODULE CHAOTIC FAULT GENERATOR
Contact Us

TSA Helps CDFG Improve Cloud Business Stability

PDF
フォーカスモード
フォントサイズ
最終更新日: 2026-04-01 18:15:55

Background

As a leading enterprise in the domestic duty-free industry, Hainan Zhike, a subsidiary of China Duty Free Group (CDFG), has built its core systems by leveraging Tencent Cloud products such as Cloud Virtual Machine (CVM), Tencent Kubernetes Engine (TKE), and cloud databases. Facing exponential business growth and rapidly increasing system complexity, Tencent Cloud and Zhike jointly launched the "Cloud Business Stability Leap Program". Through architecture optimization, risk governance, and technological innovation, they established a highly available, stable, and reliable cloud business system, setting new benchmarks for digital transformation in the industry.

Business Challenges and Optimization Paths

As CDFG Hainan Zhike gradually expands its business, its core business systems are deployed on Tencent Cloud CVM, Cloud Load Balancer (CLB), TKE services, and databases. With the increasing system complexity, the stability of online core business systems and their failover capabilities have become particularly critical.
After in-depth discussions between the Tencent Cloud team and the Zhike team, a thorough analysis of system pain points and requirements was conducted. Both parties jointly decided to further enhance the current business deployment architecture on the live network, risk governance, and failover capabilities of certain products based on existing foundations, and developed targeted optimization paths.

Pain Point and Challenge 1: Resources Scattered Across Multiple Products, Manual Ops Coordination, and Demanding Significant Effort

As CDFG Hainan Zhike continues to adopt more cloud products, clear architecture diagrams and resource distribution become critically important as technical enablers. Architecture visualization capabilities are required to integrate and display network resource information, align role information across all stages, and intuitively showcase information such as cloud product availability, potential capacity hazards, and distribution. This enables efficient Ops, reduces manpower investment, and enhances daily operational efficiency.
Formulate optimization paths:
Build a dynamic digital twin architecture diagram to visualize system architecture management.
Enhance Ops resource troubleshooting and coordination from an architecture dimension, and provide a unified view of resources and utilization levels to achieve real-time visibility into resource utilization levels and risk points.
Provide intelligent assessment from an architecture dimension to improve the efficiency of potential hazard scan.

Pain Point and Challenge 2: AZ-Level Failover and System Disaster Recovery Capabilities Need to Be Improved for Recovery Within 30 Seconds to 5 Minutes

Through risk scan of Tencent Cloud Smart Advisor (TSA), six product lines have been identified to require enhanced disaster recovery capabilities (based on business and cost requirements). By improving cross-AZ disaster recovery, failover capabilities are strengthened to enhance business robustness.
Formulate optimization paths:
Architecture Level
Product
Optimization Path
Access layer
Public network CLB
Deploy instances across AZs to enhance business robustness.
Business layer
TKE
Add failover capabilities to enhance business robustness.
Middleware
Ckafka
Deploy instances across AZs to enhance business robustness.
Data layer
MySQL
Deploy instances across AZs to enhance business robustness.
CRS
Deploy instances across AZs to enhance business robustness.
MongoDB
Deploy instances across AZs to enhance business robustness.

Pain Point and Challenge 3: Major Promotional Events Require Assurance Capabilities, with Proactive Detection of Risks and Utilization Levels

In accordance with the business operations plan, CDFG Hainan Zhike needs to support key promotional events such as the "CNY 9.9 Flash Sale", "Spring Festival Holiday Mega Sales", and "Double 11 Mega Sales". Prior to these events, the company must identify any potential hazards associated with critical cloud resources and address them in advance to ensure that sufficient cloud resources are available during the promotional events. Through real-time monitoring, risks must be detected and resolved in time to reduce risks on the live network by more than 90%, thereby ensuring the smooth execution of these events.
Formulate optimization paths:
Establish a "Three-Stage System for TSA-Infrastructure Event Management (TSA-IEM)" to ensure system stability during major business events:
Before the event: 100% risk governance through product assessment and resource elasticity prediction.
During the event: real-time node monitoring for business assurance, with ChatBI dashboards to flexibly obtain multidimensional core business metrics.
After the event: TSA-IEM report summarization for continuous architecture governance based on Well-Architected Framework.

Pain Point and Challenge 4: Zero Defects Are Required in Critical Business Data Recovery, But Zero RPO Capability of Database Products Is Difficult to Verify

For scenarios such as unexpected network disconnection of existing instances (instance isolation) or accidental data loss of existing databases (existing business data loss), where the RPO requirement is zero, fault experiments and recovery must be conducted. Experiments are conducted on database-related businesses to validate the zero RPO objective and build confidence in the product's fault recovery capabilities.
Formulate optimization paths:
Through in-depth practice of Chaos Engineering, conduct experiments in database fault scenarios:
Simulation of dual scenarios: database network disconnection and database deletion.
Second-level data recovery verification (zero RPO achieved).
Emergency plan effectiveness improved by 50%.

Solutions

Using TSA-Cloud Architecture for One-Stop Cloud Governance Based on Architecture Visualization

Through the architecture visualization feature provided by TSA-Cloud Architecture, business architecture diagrams are created to digitalize the business system architecture of CDFG Hainan Zhike. This clearly presents the relationships between various business sub-modules at a glance. Through precise binding of cloud resources to each node in the architecture diagram, the actual resource deployment status of each node is clearly grasped, and dependencies between cloud products are identified, ensuring efficient resource management. Additionally, TSA-Cloud Risk Assessment is utilized to perform one-click scan and governance of availability hazards in the business architecture, providing support for subsequent chaos experiments.
Cloud architecture diagram: Tencent Cloud product resources and instances are bound to refine the architecture topology.
Architecture risk assessment view: High-risk nodes are identified in real time and the governance order is determined.
Automated assessment: Core business components in the cloud architecture are covered.

Stability Governance Improves Product Failover Capabilities Against AZ-Level Faults, Achieving Recovery Within 30 Seconds to 5 Minutes

Improve failover capabilities through three governance measures:
1. Implement multi-AZ high availability deployment for CLBs to enable intelligent traffic scheduling and ensure that the cross-AZ switch time is less than 30 seconds.
Under normal circumstances, traffic flows in from the primary AZ and is forwarded to the backend CVMs.
When the primary AZ becomes unavailable, traffic is automatically switched to the secondary AZ, and the primary-secondary switch takes approximately 30 seconds.
2. Optimize TKE container orchestration, implement high availability for migrating from self-built Rancher to TKE, and perform automatic migration in less than 2 minutes in node fault scenarios.
Ensure that the nodes of the business container infrastructure are evenly distributed across two AZs.
Deploy core business workloads across AZs and nodes, evenly distribute Pods across AZs, and further spread them across nodes in an AZ.
3. Implement multi-instance high availability deployment for databases, three-node disaster recovery, and self-healing architecture reconstruction in less than 60 seconds.
VIP: Cross-AZ access is implemented with failover capabilities, achieving a switch within 60 seconds.
Three-node disaster recovery in two AZs: One primary node and one secondary node are deployed in Zone A, while one secondary node is deployed in Zone B. The primary node provides external services, synchronizing data to secondary nodes via global transaction identifier (GTID) replication. When the primary node fails, the arbitration module performs a leader re-election, prioritizing nodes in the same AZ. It attempts to restore the original primary node as a secondary node. If the restoration is unsuccessful, a new secondary node is automatically provisioned to restore the three-node architecture.
RO groups: They are used for read/write splitting. Different RO groups can be deployed across AZs.

Using TSA-Cloud Risk Assessment and TSA-IEM for 100% Risk Detection Ahead of Major Events and Real-Time Monitoring of Resource Utilization

Through TSA-Cloud Risk Assessment and TSA-IEM, comprehensive scan of cloud resources on the live network can be performed. Resource assessment is performed based on traffic volume during major events, and various risks, such as resource risks, deployment risks, and resource utilization risks, are detected in advance. Before the execution of TSA-IEM, assessment reports and corresponding risk optimization and handling suggestions are output. This ensures that major promotional events are started only after no risks are detected.
Through TSA-Cloud Risk Assessment and TSA-IEM, the cloud resource health score of CDFG Hainan Zhike is increased from 76% to 90%, enabling the company to successfully navigate the peak event period and meet its sales targets.

Using TSA-CFG to Conduct Fault Experiments on Databases and Validate the Zero RPO Capability of Databases

The Tencent Cloud expert team collaborated with the CDFG business Ops team to meticulously formulate a database instance fault experiment plan. This plan covers the most common database instance fault scenarios and other items, aiming to comprehensively verify the stability and reliability of TencentDB products through combined experiments of fault scenarios, providing sufficient confidence for CDFG to use these products.
Scenario 1
Configure a security group to simulate the unavailability issue of a MySQL instance. Then re-create an instance and import backup data (at the instance level).
Scenario 2
The customer deletes business data and then restores it (at the data level).
CDFG Hainan Zhike's business Ops team became familiar with the recovery procedures for common database faults causing data loss through experiments. The team verified that, when data loss occurs due to unexpected faults, the database instance achieves second-level data restoration with 100% accuracy. This comprehensively validates the stability and reliability of TencentDB products, providing sufficient confidence for future use.

Customer Benefits

CDFG Hainan Zhike validated the availability of its core business systems through this full-lifecycle governance of the cloud architecture and business stability improvement practice, mainly achieving the following benefits:
1. Improvement of cross-team asset collaboration management efficiency: Cloud resource topologies are visualized to establish multidimensional mapping relationships between businesses and resources. Intelligent associative search and drill-down positioning are supported, and cross-team asset collaboration management efficiency is improved by 40%.
2. Improvement of risk handling efficiency and business system stability: The intelligent risk management system integrates automated assessment and real-time monitoring capabilities to accurately identify architecture vulnerabilities. In combination with an end-to-end governance solution, the risk handling efficiency is improved by 50% and the business system stability metric is improved by 30%.
3. Improvement of cloud service reliability: Through experiment validation by TSA-CFG, the team's fault prediction and emergency response capabilities are strengthened, improving cloud service reliability by 35%.
4. Reduced Recovery Time Objective (RTO): Through multi-AZ disaster recovery deployment, critical business systems are guaranteed to achieve the RTO of 15 minutes or less in the event of an AZ-level fault.
5. Optimization of the emergency response system: By enhancing the knowledge base and contingency plan library for fault handling, the team achieves a 60% improvement in the standardized emergency response speed and a 98% accuracy rate in handling major incidents.
System health score: increased from 76% to 90%
Risk occurrence rate during major promotional events: decreased by 92%
Data recovery accuracy rate: 100%
Annual Ops costs: decreased by 40%
This practice has established a new benchmark for cloud stability in the retail industry, laying a solid digital foundation for the group's global expansion.

ヘルプとサポート

この記事はお役に立ちましたか?

フィードバック