kubernetes: Client TCP connections have to wait full timeout when set of endpoints goes from empty to non-empty
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
- I started a TCP client and server pod in parallel.
- The client tries to access the server through a service IP
- Client manages to get its SYN packet sent before kube-proxy finishes setting up the NAT rules for the service.
- The first packet establishes a conntrack entry for the TCP connection but the conntrack entry isn’t NATted.
- All subsequent SYN retries get blackholed rather than NATted.
- Client waits for connection timeout.
- Subsequent TCP connections get NATted correctly.
We hit this while testing Calico with the k8s e2e’s but doing the stop/continue trick with calico to slow down our policy rendering doesn’t reproduce the issue; only preventing the NAT rule from being in place seems to have an effect: https://github.com/projectcalico/felix/issues/1490
This issue is closely related, for UDP: https://github.com/kubernetes/kubernetes/issues/48370. However, the impact for UDP is more severe due to the lack of a connection timeout.
What you expected to happen:
Ideally, the initial TCP connection should connect as soon as the NAT rules are inserted. Failing that, it’d be good to get a timely rejection rather than sending the traffic into a black hole.
Flushing the (TCP) conntrack entries for a service IP as described in https://github.com/kubernetes/kubernetes/issues/48370 may work to fix this. I also noticed there was an iptables rule that looks like it was designed to help in this case but it only applies in the OUTPUT chain: -A KUBE-SERVICES -d 10.102.151.4/32 -p tcp -m comment --comment "default/nginx: has no endpoints" -m tcp --dport 80 -j REJECT --reject-with icmp-port-unreachable.
How to reproduce it (as minimally and precisely as possible):
- send
kube-proxya SIGSTOP to pause it - create server (e.g. nginx) pod
- create client (e.g. busybox) pod
- create service for server pod
- from client, try to
wgetthe server via its service - send
kube-proxya SIGCONT - should have a conntrack entry with orig src/dst = client/service and reply src/dst = service/client; wget should time out
- subsequent wget should return quickly
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version): 1.7 - Cloud provider or hardware configuration**: GCE
- OS (e.g. from /etc/os-release): Ubuntu 16.04
- Kernel (e.g.
uname -a): Linux smc-ubuntu 4.4.0-78-generic #99-Ubuntu SMP Thu Apr 27 15:29:09 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux - Install tools: kubeadm
- Others:
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 33 (22 by maintainers)
I would think that immediately rejecting seems better from a user experience/debugging perspective of least surprise? I also imagine it would create unnecessary/variable delays for code that’s sitting in loops waiting to reach a service?
For UDP there is no handshake mechanism so the user is completely blind that anything is wrong without some kind of icmp response I believe?
In Calico (and I imagine others as well) if nothing handles the packet then it ends up routing to the default gateway which seems completely wrong and again very confusing?
I agree that tcp-reset seems more natural.
@caseydavenport @dcbw Any thoughts on this? I’m not sure if this is something podreadiness is meant to solve, but it seems like there are a couple of iptables enhancements we might be able to make.