Out of Memory (OOM) indicates the programs run with more memory than the maximum memory available because memory cannot be repossessed or is used too much in the application system. When cgroup memory is insufficient, Linux kernel triggers cgroup OOM to kill some processes, so as to repossess some memory to keep continuous operation of the system. As many bugs may occur while Linux kernel (especially the earlier versions such as v3.10) processes cgroup OOM, frequent cgroup OOM occurrence may result in node failures (crash, restart, and unkillable abnormal processes).
OOM-Guard is an add-on provided by TKE for processing container cgroup OOM in user mode. When cgroup OOM occurs, before the kernel kills the container process, OOM-Guard kills the excessive container in the user space. This reduces the chance of various node failures triggered by memory repossessing failures in kernel mode.
Before the OOM threshold is triggered, OOM-Guard writes
memory.force_empty to trigger relevant cgroup memory repossessing. If
memory.stat still contains a large amount of cache data, no subsequent processing policies will be triggered. After a container is killed due to cgroup OOM, the add-on reports the
OomGuardKillContainer event to Kubernetes. You can query the event by running the
kubectl get event command.
The core concept is to kill the excessive containers in user space before kernel kills the container processes due to cgroup OOM. This reduces the chance of various kernel errors triggered by code branches that encounter repossessing failure of kernel cgroup memory.
OOM-Guard will set "threshold notify" mechanism for memory cgroup to receive notifications from the kernel. For more information, see threshold notify.
For example, the memory limit set for a pod is 1000M, OOM-Guard will calculate margin based on the configuration parameters.
margin = 1000M * margin_ratio = 20M // the default value of margin_ratio is 0.02
In addition, the minimum value of margin is min_margin (1M) and maximum value is max_margin (50M). If it exceeds the limit, min_margin or max_margin is applied.
Calculate the threshold:
threshold = limit - margin // i.e. 1000M - 20M = 980M
980M is the threshold that is set to the kernel. When the memory used by the pod reaches 980M, OOM-Guard will receive a notification sent by the kernel.
Before threshold is triggered, OOM-Gurad writes
memory.force_empty to trigger relevant cgroup memory repossessing. In addition, if threshold is triggered and
memory.stat of relevant cgroup still contains a large amount of cache data, the subsequent processing policies will not be triggered. Thus, when cgroup memory reaches the limit, kernel still triggers cgroup OOM.
You can control the processing policies by setting the
--policy parameter. The following three policies are available for now. The default policy is "container".
|process||It uses a policy the same as the cgroup OOM killer. It selects a process with the highest value of oom_score inside the cgroup, and kills the process by "SIGKILL" sent from OOM-Guard.|
|container||It selects a docker container under this cgroup and kills the whole container.|
|noop||It only records logs but does not take any action.|
|Kubernetes Object||Type||Required Resources||Namespaces|
|oom-guard||DaemonSet||0.02-core CPU, 120 MB memory||kube-system|
This add-on is suitable for Kubernetes clusters where the node memory pressure is high and node failures are often caused by business container OOM.