Background
In intra-city active-active or cross-region multi-active disaster recovery scenarios, disaster recovery resources are typically deployed in different availability zones (AZs) or regions from the primary resources, each belonging to different subnets. When a fault occurs at the AZ or region level, a disaster recovery switch can be initiated. To verify the effectiveness of your disaster recovery architecture, you can use the Virtual Private Cloud (VPC) subnet network isolation action provided by TSA-Chaotic Fault Generator (TSA-CFG) to block the subnet where the primary resources are located, simulating the inaccessibility of resources due to a fault.
Fault Effect
VPC subnet network isolation is a fault injection method that provides two injection modes: single-subnet isolation and all-subnet isolation.
Single Subnet
When the fault action parameter Isolation Scope is set to Single Subnet, the inbound and outbound traffic of the selected subnet will be blocked, and access between subnets will also be restricted, but internal traffic within each subnet will not be affected.
All Subnets
When the fault action parameter Isolation Scope is set to All Subnets, the inbound and outbound traffic of the selected subnets will be blocked, but access between the selected subnets will not be affected. You can simultaneously inject faults into the subnets belonging to the same AZ to simulate network isolation faults between this AZ and other AZs.
Experiment Preparation
Create two subnets in the same VPC, and associate private network Cloud Load Balancer (CLB), Cloud Virtual Machine (CVM), and MySQL resources in the subnets. Since instances in the same VPC are interconnected via the network by default, resources between subnets can access each other before fault injection. When the subnet of the primary AZ is blocked, instances in the subnet can still access each other normally, but external access to instances in the subnet will fail.
Note:
Subnet network isolation is achieved by setting network Access Control List (ACL) rules for the subnet, and any existing persistent connections will be immediately disconnected.
When network ACL rules exist on the target subnet, they will be temporarily unbound during fault injection and re-bound during recovery. Do not manually modify or delete network ACL rules during the experiment.
When Isolation Scope is set to All Subnets, you can add up to 20 subnets. When you need to isolate more than 20 subnets in this case, contact Tencent Cloud Assistant to submit a ticket. This fault action does not block external Ping probes to CLBs within the subnet.
Experiment Execution
Step 1: Creating an Experiment
2. Click Create Experiment, enter the basic information about the experiment, and click Next.
3. Choose Network > Subnet from the Experiment Instance drop-down list, and click Add Instance.
After you click Add Instance, information about all subnets in the target region is displayed. You can filter the subnets by subnet ID, VPC instance ID, tag, or AZ key.
Note:
Subnet network isolation has a significant impact. Exercise caution when selecting the instance scope for fault injection.
4. After selecting the target subnet, click Add Immediately to add experiment actions.
Select Network Isolation as the experiment action, and then click Next to configure the Isolation Scope parameter. After the configuration is completed, click OK.
5. After completing the parameter configuration, set Execution Mode and Guardrail Policy, and add metrics for Observability Metrics in the Global Configuration section. After the configuration is complete, click Submit to complete the experiment creation.
Step 2: Executing the Experiment
In the experiment action group, click Execute to start executing the experiment. Since the experiment is manually controlled, it is necessary to manually execute the fault action.
Step 3: Verifying the Injection Effect
On the Network Topology page of the VPC console, you can see that network ACL rules have been added for the corresponding subnet in the network topology.
When the instance access is tested, it is expected that mutual access between instances in different subnets in the faulty AZ remains unaffected, while external access to the instances in the subnets fails.
CVM-to-CVM access
For instances in the same subnet, access is normal.
For instances in different subnets, access fails due to command blocking.
CVM-to-MySQL access
For instances in the same subnet, access is normal.
For instances in different subnets, access fails.
CVM-to-CLB access
Since CLB uses VPCGW for Ping detection, the subnet network isolation will not block Ping detection. You can use telnet to detect the service-side ports.
For instances in the same subnet, access is normal.
For instances in different subnets, access fails due to command blocking.
Step 4: Executing the Fault Recovery Action
Click Execute to execute the recovery action and wait for the action to be executed successfully.
Step 5: Verifying the Recovery Effect
Refer to Step 3 for verification. It is expected that access between instances in the same subnet and between subnets is normal.