kubernetes: kube-proxy ipvs mode cannot access clusterip:port when node reboot
What happened:
Because of known issue https://github.com/kubernetes/kubernetes/issues/71071, we update kube-proxy to 1.12.5 when using k8s 1.12.3, it does solve the problem that kube-proxy stuck after k8s run normally for a period of time. But when node reboot, sometimes we cannot access all clusterip:port It doesn’t happen every time, but it has happened in many clusters. If I manually restart kube-proxy, it all recovers and works fine until the node reboot again
for example,
kubectl get svc |grep kubernetes
kubernetes ClusterIP 192.168.0.1 <none> 443/TCP 51d
curl 192.168.0.1:443
shows
Failed connect to 192.168.0.1:443; Connection refused
however it’s ok when curl {masterip}:6443
detailed information:
kube-proxy in ipvs mode
lsmod | grep -e ipvs -e nf_conntrack_ipv4
output:
nf_conntrack_ipv4 16384 261
nf_defrag_ipv4 16384 1 nf_conntrack_ipv4
nf_conntrack 135168 11 xt_conntrack,nf_nat_masquerade_ipv4,nf_conntrack_ipv6,nf_conntrack_ipv4,nf_nat,nf_nat_ipv6,ipt_MASQUERADE,nf_nat_ipv4,xt_nat,nf_conntrack_netlink,ip_vs
cut -f1 -d " " /proc/modules | grep -e ip_vs -e nf_conntrack_ipv4
output:
nf_conntrack_ipv4
ip_vs_sh
ip_vs_wrr
ip_vs_rr
ip_vs
and the clusterip:port does exist in ipvs,
ipvsadm -ln |grep 192.168.0.1 -C 2
output:
TCP {masterip}:21177 rr
TCP {masterip}:25598 rr
TCP 192.168.0.1:443 rr
-> {masterip}:6443 Masq 1 10 0
TCP 192.168.0.3:53 rr
kube-proxy logs:
server_others.go:189] Using ipvs Proxier.
proxier.go:314] missing br-netfilter module or unset sysctl br-nf-call-iptables; proxy may not work as intended
proxier.go:368] IPVS scheduler not specified, use rr by default
server_others.go:216] Tearing down inactive rules.
server.go:447] Version: v1.12.5
onntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 917504
conntrack.go:52] Setting nf_conntrack_max to 917504
conntrack.go:83] Setting conntrack hashsize to 229376
conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
config.go:102] Starting endpoints config controller
controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
config.go:202] Starting service config controller
controller_utils.go:1027] Waiting for caches to sync for service config controller
controller_utils.go:1034] Caches are synced for endpoints config controller
controller_utils.go:1034] Caches are synced for service config controller
graceful_termination.go:160] Trying to delete rs: 192.168.13.138:44134/TCP/192.168.111.49:44134
graceful_termination.go:174] Deleting rs: 192.168.13.138:44134/TCP/192.168.111.49:44134
graceful_termination.go:160] Trying to delete rs: 192.168.0.3:53/TCP/192.168.111.43:53
graceful_termination.go:171] Not deleting, RS 192.168.0.3:53/TCP/192.168.111.43:53: 0 ActiveConn, 1 InactiveConn
graceful_termination.go:160] Trying to delete rs: 192.168.0.3:9153/TCP/192.168.111.43:9153
graceful_termination.go:174] Deleting rs: 192.168.0.3:9153/TCP/192.168.111.43:9153
What you expected to happen: it’s ok when curl {clusterip}:{port}
How to reproduce it (as minimally and precisely as possible): happen by chance when node reboot it has happened in many clusters
Environment:
- Kubernetes version: 1.12.3(only kube-proxy 1.12.5)
- OS: centos 7.5.1804
- Kernel: 4.17.11-1
- Install tools: kubespray
- Network plugin and version (if this is a network-related bug): calico v3.1.3
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 16 (6 by maintainers)
@zh168654
Unfortunately, this is not the case for me. I have recently upgraded my cluster to v1.15.9, which still exists.