kubernetes: kube-proxy iptables wrong configuration externalTrafficPolicy Local in large clusters

What happened:

In our cluster of about 260 nodes and 9000 pods we sometimes have the kube-proxy wrongly configures iptables for services with externalTrafficPolicy Local during phases of high endpoint activity (rolling redeploy). This occurs only sometimes so it is probably some kind of race condition or caused by intermittent errors like connection resets.

The node in question has (according to the API) several local pods so kube-proxy should configure iptables to forward traffic to them, but instead if drops the traffic.

The problem manifests itself as following iptables configuration for an ingress controller with externaltrafficpolicy Local, it configures it as if there is no local endpoint (which are running) so it drops the non local traffic:

-A KUBE-SERVICES -d ipaddress/32 -p tcp -m comment --comment "ingress-controller/traefik-ingress-service-team-xyz:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-LWUTJB5ULEARB6TO
-A KUBE-FW-LWUTJB5ULEARB6TO -m comment --comment "ingress-controller/traefik-ingress-service-team-xyz:http loadbalancer IP" -j KUBE-XLB-LWUTJB5ULEARB6TO
-A KUBE-FW-LWUTJB5ULEARB6TO -m comment --comment "ingress-controller/traefik-ingress-service-team-xyz:http loadbalancer IP" -j KUBE-MARK-DROP
-A KUBE-XLB-LWUTJB5ULEARB6TO -s 100.64.0.0/13 -m comment --comment "Redirect pods trying to reach external loadbalancer VIP to clusterIP" -j KUBE-SVC-LWUTJB5ULEARB6TO
-A KUBE-XLB-LWUTJB5ULEARB6TO -m comment --comment "masquerade LOCAL traffic for ingress-controller/traefik-ingress-service-team-xyz:http LB IP" -m addrtype --src-type LOCAL -j KUBE-MARK-MASQ
-A KUBE-XLB-LWUTJB5ULEARB6TO -m comment --comment "route LOCAL traffic for ingress-controller/traefik-ingress-service-team-xyz:http LB IP to service chain" -m addrtype --src-type LOCAL -j KUBE-SVC-LWUTJB5ULEARB6TO
-A KUBE-XLB-LWUTJB5ULEARB6TO -m comment --comment "ingress-controller/traefik-ingress-service-team-xyz:http has no local endpoints" -j KUBE-MARK-DROP

-A KUBE-SVC-LWUTJB5ULEARB6TO -m statistic --mode random --probability 0.02127659554 -j KUBE-SEP-WNJ5QKGUHPD3B7I5
-A KUBE-SVC-LWUTJB5ULEARB6TO -m statistic --mode random --probability 0.02173913037 -j KUBE-SEP-2LQWVHEBN6SLQ7SO
... more svcs to endpoints on other nodes

In this case the node had 5 local pods available as endpoints, none of the 10 endpoints show up in the KUBE-SEP rules, only the 48 non-local endpoints (for src local traffic).

The endpoint object in the API had the 5 local endpoints listed as ready (in .subsets[].addresses[]). The pod objects were also ready.

The node still gets traffic as the ip announcement (metallb) does see the local endpoints in the api and annouces the ip but kube-proxy/iptables drops it.

The kube-proxy logs show no errors or anything helpful for the issue, just the usual adding service, opening healthcheck/nodeport also for the service with the wrong configuration.

How to reproduce it (as minimally and precisely as possible):

Hard to reproduce, happens randomly and only in our largest clusters. Not reproduced in smaller test environments yet.

Anything else we need to know?:

Environment:

Server Version: version.Info{Major:“1”, Minor:“18”, GitVersion:“v1.18.14”, GitCommit:“89182bdd065fbcaffefec691908a739d161efc03”, GitTreeState:“clean”, BuildDate:“2020-12-18T12:02:35Z”, GoVersion:“go1.13.15”, Compiler:“gc”, Platform:“linux/amd64”}

kube-proxy in iptables mode with default settings

  • bare metall
  • flatcar linux 2605.12.0
  • 5.4
  • Install tools:
  • Network plugin and version (if this is a network-related bug): kube-proxy 1.18.14, calico 3.17.1

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 25 (17 by maintainers)

Most upvoted comments

/assign @VigneshSP94

some tips, kube-proxy expose some prometheus metrics and, when used with log lever > 4 logs much more detailed information. I think that we should leverage that to try to solve this particular problem, and improve our logging and metrics based on this experience.

@andrewsykim is working on better handling of terminating pods. It’s not clear that is happening here.

It might be worthwhile to spend some time inspecting the code and adding logs and/or metrics that might be useful to catch this in the wild…

Sorry I misunderstood, no the service is supposed to be externalTrafficPolicy Local, it does not change, but the iptables configuration kube-proxy does for this type of service is wrong.