kubernetes: Client TCP connections have to wait full timeout when set of endpoints goes from empty to non-empty

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:

I started a TCP client and server pod in parallel.
The client tries to access the server through a service IP
Client manages to get its SYN packet sent before kube-proxy finishes setting up the NAT rules for the service.
The first packet establishes a conntrack entry for the TCP connection but the conntrack entry isn’t NATted.
All subsequent SYN retries get blackholed rather than NATted.
Client waits for connection timeout.
Subsequent TCP connections get NATted correctly.

We hit this while testing Calico with the k8s e2e’s but doing the stop/continue trick with calico to slow down our policy rendering doesn’t reproduce the issue; only preventing the NAT rule from being in place seems to have an effect: https://github.com/projectcalico/felix/issues/1490

This issue is closely related, for UDP: https://github.com/kubernetes/kubernetes/issues/48370. However, the impact for UDP is more severe due to the lack of a connection timeout.

What you expected to happen:

Ideally, the initial TCP connection should connect as soon as the NAT rules are inserted. Failing that, it’d be good to get a timely rejection rather than sending the traffic into a black hole.

Flushing the (TCP) conntrack entries for a service IP as described in https://github.com/kubernetes/kubernetes/issues/48370 may work to fix this. I also noticed there was an iptables rule that looks like it was designed to help in this case but it only applies in the OUTPUT chain: -A KUBE-SERVICES -d 10.102.151.4/32 -p tcp -m comment --comment "default/nginx: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable.

How to reproduce it (as minimally and precisely as possible):

send kube-proxy a SIGSTOP to pause it
create server (e.g. nginx) pod
create client (e.g. busybox) pod
create service for server pod
from client, try to wget the server via its service
send kube-proxy a SIGCONT
should have a conntrack entry with orig src/dst = client/service and reply src/dst = service/client; wget should time out
subsequent wget should return quickly

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.7
Cloud provider or hardware configuration**: GCE
OS (e.g. from /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): Linux smc-ubuntu 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Install tools: kubeadm
Others:

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 33 (22 by maintainers)

Most upvoted comments

I would think that immediately rejecting seems better from a user experience/debugging perspective of least surprise? I also imagine it would create unnecessary/variable delays for code that’s sitting in loops waiting to reach a service?

For UDP there is no handshake mechanism so the user is completely blind that anything is wrong without some kind of icmp response I believe?

In Calico (and I imagine others as well) if nothing handles the packet then it ends up routing to the default gateway which seems completely wrong and again very confusing?

I agree that tcp-reset seems more natural.

chino on Aug 26, 2017

@caseydavenport @dcbw Any thoughts on this? I’m not sure if this is something podreadiness is meant to solve, but it seems like there are a couple of iptables enhancements we might be able to make.

cmluciano on Jun 13, 2018