kubernetes: Client TCP connections have to wait full timeout when set of endpoints goes from empty to non-empty

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

  • I started a TCP client and server pod in parallel.
  • The client tries to access the server through a service IP
  • Client manages to get its SYN packet sent before kube-proxy finishes setting up the NAT rules for the service.
  • The first packet establishes a conntrack entry for the TCP connection but the conntrack entry isn’t NATted.
  • All subsequent SYN retries get blackholed rather than NATted.
  • Client waits for connection timeout.
  • Subsequent TCP connections get NATted correctly.

We hit this while testing Calico with the k8s e2e’s but doing the stop/continue trick with calico to slow down our policy rendering doesn’t reproduce the issue; only preventing the NAT rule from being in place seems to have an effect: https://github.com/projectcalico/felix/issues/1490

This issue is closely related, for UDP: https://github.com/kubernetes/kubernetes/issues/48370. However, the impact for UDP is more severe due to the lack of a connection timeout.

What you expected to happen:

Ideally, the initial TCP connection should connect as soon as the NAT rules are inserted. Failing that, it’d be good to get a timely rejection rather than sending the traffic into a black hole.

Flushing the (TCP) conntrack entries for a service IP as described in https://github.com/kubernetes/kubernetes/issues/48370 may work to fix this. I also noticed there was an iptables rule that looks like it was designed to help in this case but it only applies in the OUTPUT chain: -A KUBE-SERVICES -d 10.102.151.4/32 -p tcp -m comment --comment "default/nginx: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable.

How to reproduce it (as minimally and precisely as possible):

  • send kube-proxy a SIGSTOP to pause it
  • create server (e.g. nginx) pod
  • create client (e.g. busybox) pod
  • create service for server pod
  • from client, try to wget the server via its service
  • send kube-proxy a SIGCONT
  • should have a conntrack entry with orig src/dst = client/service and reply src/dst = service/client; wget should time out
  • subsequent wget should return quickly

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.7
  • Cloud provider or hardware configuration**: GCE
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux smc-ubuntu 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 33 (22 by maintainers)

Most upvoted comments

I would think that immediately rejecting seems better from a user experience/debugging perspective of least surprise? I also imagine it would create unnecessary/variable delays for code that’s sitting in loops waiting to reach a service?

For UDP there is no handshake mechanism so the user is completely blind that anything is wrong without some kind of icmp response I believe?

In Calico (and I imagine others as well) if nothing handles the packet then it ends up routing to the default gateway which seems completely wrong and again very confusing?

I agree that tcp-reset seems more natural.

@caseydavenport @dcbw Any thoughts on this? I’m not sure if this is something podreadiness is meant to solve, but it seems like there are a couple of iptables enhancements we might be able to make.