Kubernetes designs and provides two types of native resources at the cluster access layer, 'Service' and 'Ingress', which are responsible for the network access layer configurations of layer 4 and layer 7, respectively. The traditional solution is to create an Ingress- or LoadBalancer-type service to bind Tencent Cloud CLBs and open services to the public. In this way, user traffic is loaded on the NodePort of the user node, and then forwarded to the container network through the KubeProxy component. This solution has some limitations in business performance and capabilities.
To address these limitations, the Tencent Cloud TKE team provides a new network mode for users who use independent or managed clusters. that is, TKE directly connects to the CLB of pods based on the ENI. This mode provides enhanced performance and business capabilities. This document describes the differences between the two modes and how to use the direct connection mode.
|Comparison Item||Direct Connection||NodePort Forwarding||Local Forwarding|
|Performance||Zero loss||NAT forwarding and inter-node forwarding||Minor loss|
|Pod update||The access layer backend automatically synchronizes updates, so the update process is stable||The access layer backend NodePort remains unchanged||Services may be interrupted without update synchronization|
|Cluster dependency||Cluster version and VPC-CNI network requirements||-||-|
|Business capability restriction||Least restriction||Unable to obtain the source IP address or implement session persistence||Conditional session persistence|
In a cluster,
KubeProxy forwards the traffic from user
NodePort through NAT to the cluster network. This process has the following problems:
KubeProxyis random and does not support session persistence.
KubeProxyhas independent load-balancing capabilities. As such capabilities cannot be concentrated in one place, global load balancing is difficult to achieve.
To address the preceding problems, the technical suggestion previously provided to users was to adopt local forwarding to avoid the problems caused by
KubeProxy NAT forwarding. However, due to the randomness of forwarding, session persistence remains unsupported when multiple replicas are deployed on a node. Moreover, when local forwarding coincides with rolling updates, services can be easily interrupted. This places higher requirements on the rolling update policies and downtime of businesses.
When a service is accessed through NodePorts, the design of NodePorts is highly fault-tolerant. The CLB binds the NodePorts of all nodes in the cluster as the backend. When any node of the cluster accesses the service, the traffic will be randomly allocated to the workloads of the cluster. Therefore, the unavailability of NodePorts or pods does not affect the traffic access of the service.
Similar to local access, in cases where the backend of the CLB is directly connected to user pods, if the CLB cannot be promptly bound to the new pod when the service is processing a rolling update, the number of CLB backends of the service entry may be seriously insufficient or even exhausted as a result of rapid rolling updates. Therefore, when the service is processing a rolling update, the security and stability of the rolling update can be ensured if the CLB of the access layer is healthy.
The control plane APIs of the CLB include APIs for creating, deleting, and modifying layer-4 and layer-7 listeners, creating and deleting layer-7 rules, and binding each listener or the rule backend. Most of these APIs are asynchronous APIs, which require the polling of request results, and API calls are time-consuming. When the scale of the user cluster is large, the synchronization of a large amount of access layer resources can impose high latency pressure on components.
TKE has launched the direct pod connection mode, which optimizes the control plane of the CLB. In the overall synchronization process, this new mode mainly optimizes batch calls and backend instance queries where remote calls are relatively frequent. After the optimization, the performance of the control plane in a typical ingress scenario is improved by 95% to 97% compared with the previous version. At present, the synchronization time is mainly the waiting time of asynchronous APIs.
For cluster scaling, the relevant data is as follows:
|Layer-7 Rule Quantity||Cluster Node Quantity||Cluster Node Quantity (Update)||Performance Before Optimization (s)||Optimized Batch Calling Performance (s)||Re-optimized Backend Instance Query Performance (s)||Time Consumption Reduction (%)|
For first-time activation and deployment of services in the cluster, the relevant data is as follows:
|Layer-7 Rule Quantity||Layer-7 Rule Quantity (Update)||Cluster Node Quantity||Performance Before Optimization (s)||Optimized Batch Calling Performance (s)||Re-optimized Backend Instance Query Performance (s)||Time Consumption Reduction (%)|
The following figure shows the comparison:
In addition to control plane performance optimization, the CLB can directly access the pods of the container network, which is the integral part of component business capabilities. This not only prevents the loss of NAT forwarding performance, but also eliminates the impact of NAT forwarding on the business features in the cluster. However, the support for optimal access to the container network remains unavailable when the project is launched.
The new mode integrates the feature that allows pods to have an ENI entry under the cluster CNI network mode in order to implement direct access to the CLB. CCN solutions are already available for implementing direct CLB backend access to the container network.
In addition to the capability of direct access, availability during rolling updates must be ensured. To implement this, we use the official feature
ReadinessGate, which was officially released in version 1.12 and is mainly used to control the conditions of pods.
By default, a pod has three possible conditions: PodScheduled, Initialized, and ContainersReady. When the state of all pods is Ready, Pod Ready also becomes ready. However, in cloud-native scenarios, the status of pods needs to be determined in combination with other factors.
ReadinessGate allows us to add fences for pod status determination so that the pod status can be determined and controlled by a third party, and the pod status can be associated with a third party.
The request process is as follows:
The request process is as follows:
To introduce ReadinessGate, the cluster version must be 1.12 or later.
When users start the rolling update of an app,
Kubernetes performs the rolling update according to the update policy. However, the identifications that it uses to determine whether a batch of pods has started only includes the statuses of the pods, but does not consider whether a health check is configured for the pods in the CLB and the pods have passed the check. If such pods cannot be scheduled in time when the access layer components experience a heavy load, the pods that have successfully completed the rolling update may not be providing services to external users, resulting in service interruption.
In order to associate the backend status of the CLB and rolling update, the new feature
ReadinessGate, which was introduced in Kubernetes 1.12, was introduced into the TKE access-layer components. With this feature, only after the TKE access-layer components confirm that the backend binding is successful and the health check is passed, will the state of
ReadinessGate be configured to enable the pods to enter the Ready state, thus facilitating the rolling update of the entire workload.
Kubernetes clusters provide a service registration mechanism. With this mechanism, you only need to register your services to a cluster as
MutatingWebhookConfigurations resources. When a pod is created, the cluster will deliver notifications to the configured callback path. At this time, the pre-creation operation can be performed for the pod, that is,
ReadinessGate can be added to the pod.
This callback process must be based on HTTPS. That is, the CA that issues requests must be configured in
MutatingWebhookConfigurations, and a certificate issued by the CA must be configured on the server.
Service registration or certificates in user clusters may be deleted by users, but these system component resources should not be modified or destroyed by users. However, such problems will inevitably occur due to users’ exploration of clusters or improper operations.
The access layer components will check the integrity of the resources above during launch. If their integrity is compromised, the components will rebuild these resources to enhance the robustness of the system.
Direct connection and NodePorts are the access layer solutions for service applications. In fact, the workloads deployed by users are the ultimate workers, and therefore the capabilities of user workloads directly determine the QPS and other metrics of services.
For these two access-layer solutions, we performed some comparative tests on network link latency under low workload pressure. The latency of direct connection on the network link of the access layer can be reduced by 10%, and traffic in the VPC network was greatly reduced. During the tests, the cluster size was gradually increased from 20 nodes to 80 nodes, and the wrk tool was used to test the network latency of the cluster. The comparison of QPS and network latency between direct connection and NodePorts is shown in the following figure:
KubeProxy has some disadvantages, but based on the various features of CLB and VPC network, we have a more localized access layer solution.
KubeProxy offers a universal and fault-tolerant design for the cluster access layer. It is basically applicable to clusters in all business scenarios. As an official component, this design is very appropriate.
Workload example: nginx-deployment-eni.yaml
tke.cloud.tencent.com/networks: tke-route-eni, meaning that the workload uses the VPC-CNI ENI mode.
apiVersion: apps/v1 kind: Deployment metadata: labels: app: nginx name: nginx-deployment-eni spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: annotations: tke.cloud.tencent.com/networks: tke-route-eni labels: app: nginx spec: containers: - image: nginx:1.7.9 name: nginx ports: - containerPort: 80 protocol: TCP
Service example: nginx-service-eni.yaml
service.cloud.tencent.com/direct-access: "true", meaning that, when synchronizing the CLB, the service configures the access backend by using the direct connection method.
apiVersion: v1 kind: Service metadata: annotations: service.cloud.tencent.com/direct-access: "true" labels: app: nginx name: nginx-service-eni spec: externalTrafficPolicy: Cluster ports: - name: 80-80-no port: 80 protocol: TCP targetPort: 80 selector: app: nginx sessionAffinity: None type: LoadBalancer
Deploying the preceding items in a cluster
In the deployment environment, you must first connect to a cluster (if you do not have a cluster, create one.) You can refer to the Help Document to configure kubectl to connect to a cluster.
➜ ~ kubectl apply -f nginx-deployment-eni.yaml deployment.apps/nginx-deployment-eni created ➜ ~ kubectl apply -f nginx-service-eni.yaml service/nginx-service-eni configured ➜ ~ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-eni-bb7544db8-6ljkm 1/1 Running 0 24s 172.17.160.191 172.17.0.3 <none> 1/1 nginx-deployment-eni-bb7544db8-xqqtv 1/1 Running 0 24s 172.17.160.190 172.17.0.46 <none> 1/1 nginx-deployment-eni-bb7544db8-zk2cx 1/1 Running 0 24s 172.17.160.189 172.17.0.9 <none> 1/1 ➜ ~ kubectl get service -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kubernetes ClusterIP 10.187.252.1 <none> 443/TCP 6d4h <none> nginx-service-eni LoadBalancer 10.187.254.62 22.214.171.124 80:32693/TCP 6d1h app=nginx
Currently, TKE uses ENI to implement the direct pod connection mode. We will further optimize this feature, including in the following respects:
Comparison with similar solutions in the industry: