tencent cloud

Feedback

Self-Heal Rules

Last updated: 2023-05-05 11:05:32

    Overview

    The instability of infrastructure and uncertainty of environment often trigger system failures at different levels. To relieve the Ops workload, the Tencent Kubernetes Engine (TKE) team has developed the self-heal feature for the Node-Problem-Detector-Plus add-on to help Ops engineers locate system exceptions and take minimal self-heal actions for various check items based on preset experiential Ops rules. Characteristics of the self-heal feature:
    The system detects persistent faults that require human intervention in real time.
    The scope of detection includes dozens of check items, such as check items on the operating system, Kubernetes environment, and runtime.
    The feature quickly responds to faults based on preset experiential rules, such as executing a fix script and rebooting an add-on.

    Check Items

    Check Item
    Description
    Risk Level
    Self-Heal Action
    FDPressure
    Too many files opened. This is to check whether the number of file descriptors of the server has reached 90% of the maximum value.
    low
    -
    RuntimeUnhealthy
    List containerd task failed
    low
    RestartRuntime
    KubeletUnhealthy
    Call kubelet healthz failed
    low
    RestartKubelet
    ReadonlyFilesystem
    Filesystem is readonly
    high
    -
    OOMKilling
    Process has been oom-killed
    high
    -
    TaskHung
    Task blocked more then beyond the threshold
    high
    -
    UnregisterNetDevice
    Net device unregister
    high
    -
    KernelOopsDivideError
    Kernel oops with divide error
    high
    -
    KernelOopsNULLPointer
    Kernel oops with NULL pointer
    high
    -
    Ext4Error
    Ext4 filesystem error
    high
    -
    Ext4Warning
    Ext4 filesystem warning
    high
    -
    IOError
    IOError
    high
    -
    MemoryError
    MemoryError
    high
    -
    DockerHung
    Task blocked more then beyond the threshold
    high
    -
    KubeletRestart
    Kubelet restart
    low
    -

    Enabling the Self-Heal Feature for Nodes

    Enabling the feature in the TKE console

    1. Log in to the TKE console and select Cluster in the left sidebar.
    2. On the cluster list page, click the ID of the target cluster to go to the details page.
    3. Choose Node management > Fault self-heal rule in the left sidebar to go to the Fault self-heal rule list page.
    4. Click Create rule to create a new self-heal rule. See the figure below:
    
    5. Return to the node pool list page.
    6. Click the ID of the target node pool to go to the details page of the node pool.
    7. In the Ops information section of the details page, click Edit to enable the self-heal feature for the node pool.
    8. View the details of real-time fault detection in the Ops records section. If the status of a check item is Failed, the check item failed.

    Enabling the feature by using YAML

    1. Create self-heal rules.

    Specify the YAML configuration file as follows and run the kubectl ceate -f demo-HealthCheckPolicy.yaml command to create self-heal rules for a cluster:
    apiVersion: config.tke.cloud.tencent.com/v1
    kind: HealthCheckPolicy
    metadata:
    name: test-all
    namespace: cls-xxxxxxxx (the ID of the cluster)
    spec:
    machineSetSelector:
    matchLabels:
    key: fake-label
    rules:
    - action: RestartKubelet
    enabled: true
    name: FDPressure
    - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: RuntimeUnhealthy
    - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: KubeletUnhealthy
    - action: RestartKubelet
    enabled: true
    name: ReadonlyFilesystem
    - action: RestartKubelet
    enabled: true
    name: OOMKilling
    - action: RestartKubelet
    enabled: true
    name: TaskHung
    - action: RestartKubelet
    enabled: true
    name: UnregisterNetDevice
    - action: RestartKubelet
    enabled: true
    name: KernelOopsDivideError
    - action: RestartKubelet
    enabled: true
    name: KernelOopsNULLPointer
    - action: RestartKubelet
    enabled: true
    name: Ext4Error
    - action: RestartKubelet
    enabled: true
    name: Ext4Warning
    - action: RestartKubelet
    enabled: true
    name: IOError
    - action: RestartKubelet
    enabled: true
    name: MemoryError
    - action: RestartKubelet
    enabled: true
    name: DockerHung
    - action: RestartKubelet
    enabled: true
    name: KubeletRestart
    

    2. Enable the self-heal feature.

    Set the value of the MachineSet parameter to healthCheckPolicyName: test-all in the YAML configuration file:
    apiVersion: node.tke.cloud.tencent.com/v1beta1
    kind: MachineSet
    spec:
    type: Hosted
    displayName: demo-machineset
    replicas: 2
    autoRepair: true
    deletePolicy: Random
    healthCheckPolicyName: test-all
    instanceTypes:
    - C3.LARGE8
    subnetIDs:
    - subnet-xxxxxxxx
    - subnet-yyyyyyyy
    ......
    
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support