tencent cloud

Feedback

Overview

Last updated: 2024-04-24 15:55:36

    Overview

    Component Overview

    Kubernetes' scheduling logic operates based on the Pod's Request. The schedulable resources on the node are occupied by the Pod's Request amount and cannot free up. The native node dedicated scheduler is a scheduling plugin developed by Tencent Kubernetes Engine (TKE) based on the native Kube-scheduler Extender mechanism of Kubernetes, which can virtually magnify the capacity of the node, resolving the issue of the node's resources being occupied while maintaining a low utilization rate.

    Kubernetes objects deployed in a cluster

    Kubernetes Object Name
    Type
    Requested Resource
    Belonging Namespace
    crane-scheduler-controller
    Deployment
    Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance
    kube-system
    crane-descheduler
    Deployment
    Each instance is endowed with 200m CPU and 200Mi memory, totaling one instance
    kube-system
    crane-scheduler
    Deployment
    Each instance is endowed with 200m CPU and 200Mi memory, totaling three instances
    kube-system
    crane-scheduler-controller
    Service
    -
    kube-system
    crane-scheduler
    Service
    -
    kube-system
    crane-scheduler
    ClusterRole
    -
    kube-system
    crane-descheduler
    ClusterRole
    -
    kube-system
    crane-scheduler
    ClusterRoleBinding
    -
    kube-system
    crane-descheduler
    ClusterRoleBinding
    -
    kube-system
    crane-scheduler-policy
    ConfigMap
    -
    kube-system
    crane-descheduler-policy
    ConfigMap
    -
    kube-system
    ClusterNodeResourcePolicy
    CRD
    -
    -
    CraneSchedulerConfiguration
    CRD
    -
    -
    NodeResourcePolicy
    CRD
    -
    -
    crane-scheduler-controller-mutating-webhook
    MutatingWebhookConfiguration
    -
    -

    Application Scenarios

    Scenario 1: Resolving the issue of high node box rate but low utilization

    Note:
    The fundamental concepts are as follows.
    Box Rate: It refers to the ratio of the sum of Requests of all Pods on a node to the actual specifications of the node. Utilization: It refers to the ratio of the total actual usage of all Pods on a node to the actual specifications of the node.
    The native Kubernetes scheduler schedules based on the Request resources of Pod. Therefore, even if the actual usage on the node is low at this time, if the sum of Requests of all Pods on the node is close to the actual specifications of the node, new Pods cannot be scheduled, resulting in substantial resource waste. Moreover, businesses tend to apply for surplus resources to ensure the stability of their services, that is, a large Request, leading to the occupation of node resources, unable to free up. At this point, the node's box rate is substantial, but the actual resource utilization is comparatively low.
    At such times, you can use the dedicated native node scheduler to virtually enhance the specifications of CPU and memory on a node, thus amplifying its scheduler resources. More pods can thereby be scheduled.

    Scenario 2: Setting the watermark of the nodes

    The watermark setting of the node is to ensure the stability of the node and set the node's target utilization rate:
    Control of the watermark during scheduling: This step determines the native node's target resource utilization rate to guarantee stability. While scheduling the Pods, nodes with resources above this watermark will not be selected. Moreover, from nodes meeting the watermark requirements, as shown in the following figure, nodes with lower actual load watermarks have priority to balance the cluster node's utilization distribution.
    Control of the watermark during runtime: This step determines the current target resource utilization rate for native nodes to guarantee stability. At runtime, nodes with resources above this watermark could trigger evictions. Given that eviction is a high-risk action, bear in mind the following notes.

    Notes

    1. To avoid draining important Pods, this feature is set not to evict Pods by default. For Pods that can be safely drained, it is essential for users to explicitly determine the workload to which the Pod belongs. For example, StatefulSet, Deployment, and other objects can be set as drainable annotations:
    descheduler.alpha.kubernetes.io/evictable: 'true'
    2. It is recommended to enable event persistence for the cluster, to better monitor component abnormalities and troubleshoot. When evicting a Pod, corresponding events will be generated. You can observe if the Pod is being repeatedly evicted based on the Descheduled event.
    3. The eviction action has requirements for nodes: a cluster is required to have 3 or more low-load native nodes, where a low-load definition refers to a Node's load that is lesser than its operational water-level control.
    4. After filtering at the node dimension, evacuation begins on the workload on the Node. This necessitates the constraint that the replica count of the workload should be equal to or greater than 2, or at least half of the Workload spec replicas.
    5. At the Pod dimension level, if a Pod's load exceeds the eviction watermark of the node, eviction is forbidden to prevent the overloading of other nodes by relocating them there.

    Scenario 3: Pods under specified Namespace shall be allocated only to native nodes upon the subsequent scheduling

    Native nodes, the newly-launched node types, are introduced by the TKE Tencent Kubernetes Engine team of Tencent Cloud. They are built upon the technical excellence derived from Tencent Cloud's tens of millions of core container operations, thereby delivering native-like, high-stability, and rapid-response K8s node management capabilities. Native nodes, with amplifiable node specifications and recommended Request capabilities, are hence highly advisable for exploiting its advantages fully by scheduling your workload to them. While enabling the native node scheduler, you can opt for Namespace. Consequently, Pods under the specified Namespace shall be scheduled exclusively to native nodes in the following scheduling.
    Note:
    If the native node resources are insufficient at this stage, it would result in Pod Pending.

    Limits

    This feature is only supported by the native node. For more information, see Native Node Overview.
    It is required to ensure that the Kubernetes version is v1.22.5-tke.8, v1.20.6-tke.24, v1.18.4-tke.28,v1.16.3-tke.30 or higher. For cluster versions upgrade, see Upgrading a Cluster.

    Risk Control

    After the uninstallation of this component, only the scheduling logic associated with the native node-dedicated scheduler will be eliminated, leaving the scheduling capability of the native Kube-Scheduler untouched. The already scheduled Pods on the native node will not be affected due to their pre-set schedule. However, a reboot of kubelet on the native node might trigger Pod eviction as the sum of Pods' Requests on the native node could exceed the genuine specifications of the native node.
    In the event of the amplification coefficient being adjusted downwards, the existing Pods on the native node, due to their already prescribed schedule, will remain unaffected. Nonetheless, if the kubelet on the native node restarts, it might trigger Pod eviction since the aggregate of Pods' Requests on the native node could surpass the amplified specifications of the native node after the amplification.
    Users witness the inconsistency between Node resources in the Kubernetes cluster and corresponding CVM node resources.
    In the future, issues related to excessive load and instability could possibly arise.
    After the amplification of the node specifications, the node kubelet layer and the resource QoS-related modules might be affected. For instance, kubelet's binding cores, when a 4-core node is treated as an 8-core node for scheduling, the Pods' binding cores could possibly be impacted.

    Component Permission Description

    Crane Scheduler Permission

    Permission Description

    The permission of this component is the minimal dependency required for the current feature to operate.

    Permission Scenarios

    Feature
    Involved Object
    Involved Operation Permission
    It is required to keep track of the updates and changes to the node, as well as the utilization of the access node.
    nodes
    get/watch/list
    Track the updates and changes of pods, and determine the scheduling priority of nodes based on the recent scheduling situation of pods within the cluster.
    pods/namespaces
    get/watch/list
    It is required to update node utilization to node resources, thereby achieving the decoupling of scheduling and query logic.
    nodes/status
    patch
    It is required to support multiple replicas to ensure component availability.
    leases
    create/get/update
    It is required to track the updates and changes of the configmap, implementing the feature of scheduling specified pods to native nodes.
    configmap
    get/list/watch

    Permission Definition

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
    name: crane-scheduler
    rules:
    - apiGroups:
    - ""
    resources:
    - pods
    - nodes
    - namespaces
    verbs:
    - list
    - watch
    - get
    - apiGroups:
    - ""
    resources:
    - nodes/status
    verbs:
    - patch
    - apiGroups:
    - ""
    resources:
    - configmaps
    verbs:
    - get
    - list
    - watch
    - apiGroups:
    - extensions
    - apps
    resources:
    - deployments/scale
    verbs:
    - get
    - update
    - apiGroups:
    - coordination.k8s.io
    resources:
    - leases
    verbs:
    - create
    - get
    - update
    - apiGroups:
    - "scheduling.crane.io"
    resources:
    - clusternoderesourcepolicies
    - noderesourcepolicies
    - craneschedulerconfigurations
    verbs:
    - get
    - list
    - watch
    - update
    - create
    - patch

    Crane Descheduler Permission

    Permission Description

    The permission of this component is the minimal dependency required for the current feature to operate.

    Permission Scenarios

    Feature
    Involved Object
    Involved Operation Permission
    It is required to keep track of the updates and changes to the node, as well as the utilization of the access node.
    nodes
    get/watch/list
    Track the updates and changes of the pods, determining the pods to be evicted first based on the information of the pods within the clusters.
    pods
    get/watch/list
    Drain the pod.
    pods/eviction
    create
    It is required to determine whether the number of ready workloads where the pod resides constitutes half or more of the total requirements to decide whether to drain the pod.
    replicasets/deployments/statefulsets/statefulsetpluses/job
    get
    Report events when draining Pods.
    create
    events

    Permission Definition

    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
    name: crane-descheduler
    namespace: kube-system
    rules:
    - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "watch", "list"]
    - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]
    - apiGroups: [""]
    resources: ["nodes/status"]
    verbs: ["patch"]
    - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
    - apiGroups: ["*"]
    resources: ["replicasets"]
    verbs: ["get"]
    - apiGroups: ["*"]
    resources: ["deployments"]
    verbs: ["get"]
    - apiGroups: ["apps"]
    resources: ["statefulsets"]
    verbs: ["get"]
    - apiGroups: ["platform.stke"]
    resources: ["statefulsetpluses"]
    verbs: ["get"]
    - apiGroups: [""]
    resources: ["events"]
    verbs: ["create"]
    - apiGroups: ["*"]
    resources: ["jobs"]
    verbs: ["get"]
    - apiGroups: [ "coordination.k8s.io" ]
    resources: [ "leases"
    
    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support