动态与公告
- 产品动态
- 公告
- 产品发布记录
产品简介
购买指南
- 购买指引
- 购买 TKE 标准集群
- 购买原生节点
- 购买超级节点
快速入门
- 新手指引
- 快速创建一个标准集群
- 入门示例
- 容器应用部署 Check List
集群配置
- 标准集群概述
- 集群管理
- 网络管理
- 存储管理
- 节点管理
- GPU 资源管理
- 远程终端
应用配置
- 工作负载管理
- 服务和配置管理
- 组件和应用管理
- 弹性伸缩
- 容器登录方式
可观测配置
- 运维可观测性
- 成本洞察和优化
调度配置
- 调度组件概述
- 资源利用率优化调度
- 业务优先级保障调度
- Qos 感知调度
安全和稳定性
- 容器服务安全组设置
- 身份验证和授权
- 应用安全
多集群管理
- 计划升级
- 备份中心
云原生服务指南
- 云原生 etcd
- Prometheus 监控服务
- TKE Serverless 集群指南
- TKE 注册集群指南
实践教程
- 集群
- Serverless 集群
- 调度
- 安全
- 服务部署
- 网络
- 发布
- 日志
- 监控
- 运维
- Terraform
- DevOps
- 弹性伸缩
- 容器化
- 成本管理
- 混合云
- AI
故障处理
API 文档
- History
- Introduction
- API Category
- Making API Requests
- Elastic Cluster APIs
- Resource Reserved Coupon APIs
- Cluster APIs
- Third-party Node APIs
- Relevant APIs for Addon
- Network APIs
- Node APIs
- Node Pool APIs
- TKE Edge Cluster APIs
- Cloud Native Monitoring APIs
- Scaling group APIs
- Super Node APIs
- Other APIs
- Data Types
- Error Codes
- TKE API 2022-05-01
常见问题
- TKE 标准集群
- TKE Serverless 集群
- 运维类
- 隐患处理
- 服务类
- 镜像仓库类
- 远程终端类
- 事件类
- 资源管理类
服务协议
- TKE Service Level Agreement
- TKE Serverless Service Level Agreement
联系我们
词汇表

故障自愈规则

Download

聚焦模式

字号

最后更新时间： 2023-05-05 11:15:36

功能概述
基础设施的不稳定性、环境的不确定性经常会引发不同纬度的系统故障。为了将工作人员从繁重的运维事务中解放出来，腾讯云容器服务团队自研故障自愈功能来帮助运维人员快速定位问题，并通过预置平台运维经验，针对不同检测项提供最小化的自愈动作。该能力在 NPD Plus 组件的基础上进一步扩展，具体包含如下特性：
系统实时检测需要人为干预解决的持续性故障。
故障范围涵盖操作系统、K8s 环境、运行时等数十种检测项。
通过预置专家经验（执行修复脚本、重启组件）来对故障进行快速响应。
检测项介绍
检测项
描述
风险等级
自愈动作
FDPressure
Too many files opened（查看主机的文件描述符数量是否达到最大值的 90%）
low
-
RuntimeUnhealthy
List containerd task failed
low
RestartRuntime
KubeletUnhealthy
Call kubelet healthz failed
low
RestartKubelet
ReadonlyFilesystem
Filesystem is readonly
high
-
OOMKilling
Process has been oom-killed
high
-
TaskHung
Task blocked more then beyond the threshold
high
-
UnregisterNetDevice
Net device unregister
high
-
KernelOopsDivideError
Kernel oops with divide error
high
-
KernelOopsNULLPointer
Kernel oops with NULL pointer
high
-
Ext4Error
Ext4 filesystem error
high
-
Ext4Warning
Ext4 filesystem warning
high
-
IOError
IOError
high
-
MemoryError
MemoryError
high
-
DockerHung
Task blocked more then beyond the threshold
high
-
KubeletRestart
Kubelet restart
low
-
为节点开启故障自愈功能
通过控制台操作
1. 登录 容器服务控制台，选择左侧导航栏中的集群。
2. 在集群列表页中，单击集群 ID，进入该集群详情页。
3. 选择左侧菜单栏中的节点管理 > 故障自愈规则，进入故障自愈规则页面。
4. 单击新建故障自愈规则，创建新的故障自愈规则。如下图所示：
﻿
5. 创建完成后，返回节点池列表页。
6. 单击节点池 ID，进入节点池详情页。
7. 在节点池详情页的“运维信息”模块，单击编辑为节点池开启故障自愈能力。
8. 开启后，可以在“运维记录”中查看实时故障检测详情，状态为“失败”则代表该检测项未通过。
通过 YAML 操作
1. 新建故障自愈规则
根据命令kubectl ceate -f demo-HealthCheckPolicy.yaml集群中创建自愈规则，YAML 配置如下：
apiVersion: config.tke.cloud.tencent.com/v1
kind: HealthCheckPolicy
metadata:
  name: test-all
  namespace: cls-xxxxxxxx（集群 id）
spec:
  machineSetSelector:
    matchLabels:
      key: fake-label
  rules:
  - action: RestartKubelet
    enabled: true
    name: FDPressure
  - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: RuntimeUnhealthy
  - action: RestartKubelet
    autoRepairEnabled: true
    enabled: true
    name: KubeletUnhealthy
  - action: RestartKubelet
    enabled: true
    name: ReadonlyFilesystem
  - action: RestartKubelet
    enabled: true
    name: OOMKilling
  - action: RestartKubelet
    enabled: true
    name: TaskHung
  - action: RestartKubelet
    enabled: true
    name: UnregisterNetDevice
  - action: RestartKubelet
    enabled: true
    name: KernelOopsDivideError
  - action: RestartKubelet
    enabled: true
    name: KernelOopsNULLPointer
  - action: RestartKubelet
    enabled: true
    name: Ext4Error
  - action: RestartKubelet
    enabled: true
    name: Ext4Warning
  - action: RestartKubelet
    enabled: true
    name: IOError
  - action: RestartKubelet
    enabled: true
    name: MemoryError
  - action: RestartKubelet
    enabled: true
    name: DockerHung
  - action: RestartKubelet
    enabled: true
    name: KubeletRestart
﻿
2. 开启自愈开关
在 MachineSet 中指定字段 healthCheckPolicyName: test-all，YAML 配置如下：
apiVersion: node.tke.cloud.tencent.com/v1beta1
kind: MachineSet
spec:
  type: Hosted
  displayName: demo-machineset
  replicas: 2
  autoRepair: true
  deletePolicy: Random
  healthCheckPolicyName: test-all
  instanceTypes:
  - C3.LARGE8
  subnetIDs:
  - subnet-xxxxxxxx
  - subnet-yyyyyyyy
......
﻿
﻿

帮助和支持

本页内容是否解决了您的问题？

您也可以联系销售或提交工单以寻求帮助。

填写满意度调查问卷，共创更好文档体验。

文档反馈

检测项	描述	风险等级	自愈动作
FDPressure	Too many files opened（查看主机的文件描述符数量是否达到最大值的 90%）	low	-
RuntimeUnhealthy	List containerd task failed	low	RestartRuntime
KubeletUnhealthy	Call kubelet healthz failed	low	RestartKubelet
ReadonlyFilesystem	Filesystem is readonly	high	-
OOMKilling	Process has been oom-killed	high	-
TaskHung	Task blocked more then beyond the threshold	high	-
UnregisterNetDevice	Net device unregister	high	-
KernelOopsDivideError	Kernel oops with divide error	high	-
KernelOopsNULLPointer	Kernel oops with NULL pointer	high	-
Ext4Error	Ext4 filesystem error	high	-
Ext4Warning	Ext4 filesystem warning	high	-
IOError	IOError	high	-
MemoryError	MemoryError	high	-
DockerHung	Task blocked more then beyond the threshold	high	-
KubeletRestart	Kubelet restart	low	-

tencent cloud

容器服务

故障自愈规则

功能概述

检测项介绍

为节点开启故障自愈功能

通过控制台操作

通过 YAML 操作

1. 新建故障自愈规则

2. 开启自愈开关

帮助和支持