tencent cloud

Feedback

Cluster kube-proxy Troubleshooting

Last updated: 2022-11-02 14:35:48

    TKE cluster access may fail in some cases. If you have confirmed that the backend Pod is normal, the cause may be that the kube-proxy add-on version is earlier than required, preventing iptables or IPVS forwarding rules on the node from being delivered successfully. This document describes some problems due to earlier kube-proxy versions and offers fixes. If you still have problems, contact us for assistance.

    kube-proxy was not correctly adapted to the iptables backend of the node

    Sample error message

    Failed to execute iptables-restore: exit status 2 (iptables-restore v1.8.4 (legacy): Couldn't load target 'KUBE-MARK-DROP':No such file or directory
    

    Cause

    1. When iptables-restore is executed in kube-proxy, the dependent KUBE-MARK-DROP chain doesn't exist, leading to the rule sync failure and exit. The KUBE-MARK-DROP chain is maintained by kubelet.
    2. On some later OS versions, the iptables backend is nft; while on earlier kube-proxy versions, the iptables backend is legacy. When kube-proxy on an earlier version runs on OS on a later version, the iptables backend cannot be matched, and the KUBE-MARK-DROP chain cannot be read. Later OS versions include:
      • TLinux 2.6 (TK4)
      • TLinux 3.1
      • TLinux 3.2
      • CentOS 8
      • Ubuntu 20

    Fix guide

    Upgrade kube-proxy. Below is the sample logic:

    TKE Cluster Version Fix Policy
    > 1.18 No fixes are required, as the problem doesn't exist.
    1.18 Upgrade kube-proxy to v1.18.4-tke.26 or later.
    1.16 Upgrade kube-proxy to v1.16.3-tke.28 or later.
    1.14 Upgrade kube-proxy to v1.14.3-tke.27 or later.
    1.12 Upgrade kube-proxy to v1.12.4-tke.31 or later.
    1.10 Upgrade kube-proxy to v1.10.5-tke.20 or later.
    Note:

    For more information on the latest TKE versions, see TKE Kubernetes Revision Version History.


    iptables lock of kube-proxy

    Concurrent write failure due to no iptables lock mounted to another add-on

    Sample error message

    Failed to execute iptables-restore: exit status 1 (iptables-restore: line xxx failed)
    

    Cause

    1. When writing iptables rules to the kernel, iptables commands (such as iptables-restore) will use a file lock for sync to avoid concurrent writes of multiple instances. On Linux, the file is generally /run/xtables.lock.
    2. For a Pod that needs to call iptables commands, such as kube-proxy, kube-router, or HostNetwork Pod on the client, if the file is not mounted, the above problem of concurrent writes may occur.

    Fix guide

    For a Pod that needs to call iptables commands, you need to mount the host /run/xtables.lock file to the Pod as follows:

           volumeMounts:
           -   mountPath: /run/xtables.lock
             name: xtables-lock
             readOnly: false
         volumes:
         -   hOStPath:
             path: /run/xtables.lock
             type: FileOrCreate
           name: xtables-lock
    

    Writes are blocked due to an earlier iptables-restore version

    Sample error message

    Failed to execute iptables-restore: exit status 4 (Another app is currently holding the xtables lock. Perhaps you want to use the -w option?)
    

    Cause

    1. When writing iptables rules to the kernel, iptables commands (such as iptables-restore) will use a file lock for sync to avoid concurrent writes of multiple instances. When iptables-restore is executed, it tries getting a file lock or exits if the lock is held by another process.
    2. The error is a soft error, and kube-proxy will try getting the lock again in the next sync cycle (or when the next Service event is triggered). If the lock cannot be obtained after several attempts, a high latency will occur during rule sync.
    3. The iptables-restore on later versions provide a -w(--wait) option. If -w=5, iptables-restore will be blocked for five seconds when getting the lock. If another process releases the lock during this period, iptables-restore can continue its operation.

    Fix guide

    1. If kube-proxy is a binary deployment on the node, you can upgrade iptables-restore by upgrading the node OS. Below is the sample logic:
      Node OS Target Version
      CentOS 7.2 or later
      Ubuntu 20.04 or later
      Tencent Linux 2.4 or later
    2. If kube-proxy is a DaemonSet deployment in the cluster, you can upgrade iptables-restore by upgrading kube-proxy. Below is the sample logic:
      TKE Cluster Version Fix Policy
      > 1.12 No fixes are required, as the problem doesn't exist.
      1.12 Upgrade kube-proxy to v1.12.4-tke.31 or later.
      < 1.12 Upgrade the TKE cluster.
    Note:

    For more information on the latest TKE versions, see TKE Kubernetes Revision Version History.

    Another add-on holding the iptables lock for too long

    Sample error message

    Failed to ensure that filter chain KUBE-SERVICES exists: error creating chain "KUBE-EXTERNAL-SERVICES": exit status 4: Another app is currently holding the xtables lock. Stopped waiting after 5s.
    

    Cause

    1. When writing iptables rules to the kernel, iptables commands (such as iptables-restore) will use a file lock for sync to avoid concurrent writes of multiple instances. When iptables-restore is executed, it tries getting a file lock. If the lock is held by another process, iptables-restore will be blocked for a certain period of time (subject to the -w value, which is five seconds by default) before getting the lock. It will continue after getting the lock or exit.
    2. The error indicates that the iptables file lock is held by another add-on for more than five seconds.

    Fix guide

    Reduce the time when other add-ons hold the iptables file lock as much as possible. In particular, the NetworkPolicy (kube-router) add-on provided on the add-on management page in the TKE console on an earlier version holds the iptables lock for a long time. You can upgrade it to the latest version v1.3.2.


    kube-proxy to kube-apiserver connection exception

    Sample error message

    Failed to list *core.Endpoints: Stream error http2.StreamError{StreamID:0xea1, Code:0x2, Cause:error(nil)} when reading response body, may be caused by closed connection. Please retry.
    

    Cause

    There is a bug when Kubernetes on an earlier version calls the go HTTP/2 package, which causes the client to use a disabled connection of the API server. When this bug occurs in kube-proxy, rule sync will fail. For more information, see (1.17) Kubelet won't reconnect to Apiserver after NIC failure (use of closed network connection) #87615 and Enables HTTP/2 health check #95981.

    Fix guide

    Upgrade kube-proxy. Below is the sample logic:

    TKE Cluster Version Fix Policy
    > 1.18 No fixes are required, as the problem doesn't exist.
    1.18 Upgrade kube-proxy to v1.18.4-tke.26 or later.
    < 1.18 Upgrade the TKE cluster.
    Note:

    For more information on the latest TKE versions, see TKE Kubernetes Revision Version History.

    kube-proxy panicked after the first startup and became normal after restart

    Sample error message

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x1514fb8]
    

    Cause

    1. The community code of kube-proxy has a bug, where the kernel module for statistics loading is missing during initialization, leading to the use of uninitialized variables.
    2. The log is not detailed enough and failed to output the result regarding whether the IPVS mode can be used. For more information, see kube-proxy panics with SIGSEGV on first run #89729, Do not forget recording loaded modules #89823, and ipvs: log err from CanUseIPVSProxier #89785.

    Fix guide

    Upgrade kube-proxy. Below is the sample logic:

    TKE Cluster Version Fix Policy
    > 1.18 No fixes are required, as the problem doesn't exist.
    1.18 Upgrade kube-proxy to v1.18.4-tke.26 or later.
    < 1.18 No fixes are required, as the problem doesn't exist.
    Note:

    For more information on the latest TKE versions, see TKE Kubernetes Revision Version History.


    kube-proxy kept panicking

    Sample error message

    Observed a panic: "slice bounds out of range" (runtime error: slice bounds out of range)
    

    Cause

    There is a bug in the community code of kube-proxy. When iptables-save is executed, the standard output and standard error are targeted at the same buffer, and the sequence of the two is uncertain, leading to an unexpected data format in the buffer and thereby a panic during processing. For more information, see kube-proxy panics when parsing iptables-save output #78443 and Fix panic in kube-proxy when iptables-save prints to stderr #78428.

    Fix guide

    Upgrade kube-proxy. Below is the sample logic:

    TKE Cluster Version Fix Policy
    > 1.14 No fixes are required, as the problem doesn't exist.
    1.14 Upgrade kube-proxy to v1.14.3-tke.27 or later.
    1.12 Upgrade kube-proxy to v1.12.4-tke.31 or later.
    < 1.12 No fixes are required, as the problem doesn't exist.
    Note:

    For more information on the latest TKE versions, see TKE Kubernetes Revision Version History.


    kube-proxy occupied high CPU periodically in IPVS mode

    Cause

    This is because kube-proxy frequently refreshes the node Service forwarding rules, specifically:

    • kube-proxy frequently performs periodic rule syncs.
    • The business Service or Pod is frequently changed.

    Fix guide

    If the problem is caused by frequent periodic rule syncs by kube-proxy, you need to modify relevant parameters. Below are default parameters of kube-proxy on an earlier version:

    --ipvs-min-sync-period=1s (minimum refresh interval of one second)
    --ipvs-sync-period=5s (periodic refresh every five seconds)
    

    Therefore, kube-proxy refreshes the node iptables rules once every five seconds, consuming many CPU resources. You can change the configuration to:

    --ipvs-min-sync-period=0s (real-time refresh upon event occurrence)
    --ipvs-sync-period=30s (periodic refresh every 30 seconds) 
    

    The above configured values are default values and can be configured as needed.

    Contact Us

    Contact our sales team or business advisors to help your business.

    Technical Support

    Open a ticket if you're looking for further assistance. Our Ticket is 7x24 avaliable.

    7x24 Phone Support