kubernetes: kube-proxy ipvs mode hangs after a few hours, needs manual restart

What happened: upgraded from v1.11.0 to v1.12.2, now kube-proxy hangs, usually in less than a day

What you expected to happen: it should continue working

How to reproduce it (as minimally and precisely as possible): just have a normal kube-proxy running in ipvs mode. It gets stuck usually in less than a day.

Anything else we need to know?:

I’ve sent SIGABRT to the kube-proxy process to get a stack trace, and the only thing interesting is that when the bug happens goroutine 1 is reading from netlink, whereas when I send SIGABRT and the process is not stuck it is not reading from netlink.

Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: goroutine 1 [syscall]:
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.Syscall6(0x2d, 0x3, 0xc42081c000, 0x1000, 0x0, 0xc42083deb0, 0xc42083dea4, 0x40fb86, 0x7f2ab9788aa8, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.recvfrom(0x3, 0xc42081c000, 0x1000, 0x1000, 0x0, 0xc42083deb0, 0xc42083dea4, 0x101ffffffffffff, 0x0, 0x1000)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/zsyscall_linux_amd64.go:1665 +0xa6
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: syscall.Recvfrom(0x3, 0xc42081c000, 0x1000, 0x1000, 0x0, 0x1000, 0x0, 0x0, 0x16ad4c0, 0x1f4d508)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /usr/local/go/src/syscall/syscall_unix.go:252 +0xaf
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xc4204d3440, 0x0, 0x0, 0x0, 0x16ad4c0, 0x1f4d508)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:613 +0x9b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.execute(0xc4204d3440, 0xc42083e150, 0x0, 0x1, 0x2, 0xc420423740, 0x1, 0x2)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:219 +0xd3
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).doCmdwithResponse(0xc420573990, 0xc420990750, 0x0, 0xc42003d501, 0xa, 0x10, 0xc420990701, 0xc42083e230, 0x118dd6b)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:140 +0x20b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).doCmd(0xc420573990, 0xc420990750, 0x0, 0x1, 0x60000020fc6e0, 0xc420990750)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/netlink.go:149 +0x48
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs.(*Handle).NewService(0xc420573990, 0xc420990750, 0x0, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/docker/libnetwork/ipvs/ipvs.go:117 +0x43
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/ipvs.(*runner).AddVirtualServer(0xc42052a340, 0xc420b0e550, 0x1e, 0xc42083e3b8)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/ipvs/ipvs_linux.go:61 +0x60
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).syncService(0xc4201b3e00, 0xc420d53620, 0x21, 0xc420b0e550, 0xc420357901, 0x5, 0xc420d53620)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:1454 +0x635
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).syncProxyRules(0xc4201b3e00)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:803 +0x1081
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).(k8s.io/kubernetes/pkg/proxy/ipvs.syncProxyRules)-fm()
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:395 +0x2a
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).tryRun(0xc42042e3f0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:217 +0xb6
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/util/async.(*BoundedFrequencyRunner).Loop(0xc42042e3f0, 0xc4200b0120)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/util/async/bounded_frequency_runner.go:179 +0x211
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/pkg/proxy/ipvs.(*Proxier).SyncLoop(0xc4201b3e00)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/proxy/ipvs/proxier.go:589 +0x4e
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.(*ProxyServer).Run(0xc4207f2000, 0xc4207f2000, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:568 +0x3dd
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.(*Options).Run(0xc4201082c0, 0xc42063e230, 0x0)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:238 +0x69
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/cmd/kube-proxy/app.NewProxyCommand.func1(0xc4200f2500, 0xc42063e230, 0x0, 0x7)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kube-proxy/app/server.go:360 +0x156
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0xc4200f2500, 0xc4200c4010, 0x7, 0x7, 0xc4200f2500, 0xc4200c4010)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:760 +0x2c1
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc4200f2500, 0xc4205dc090, 0x16076a0, 0xc4207a7ee8)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:846 +0x30a
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(0xc4200f2500, 0x1608b88, 0x20dcf40)
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]:         /workspace/anago-v1.12.2-beta.0.59+17c77c78982180/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:794 +0x2b
Nov 15 11:42:05 hex-48b-pm kube-proxy[5389]: main.main()

The last lines that it logged were:

Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392942    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392952    5389 graceful_termination.go:160] Trying to delete rs: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: E1115 07:48:52.392978    5389 graceful_termination.go:89] Try delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432" err: Failed to delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432", can't 
find the real server
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: I1115 07:48:52.392985    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:48:52 hex-48b-pm kube-proxy[5389]: E1115 07:48:52.392992    5389 graceful_termination.go:183] Try flush graceful termination list err
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393147    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393260    5389 graceful_termination.go:89] Try delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125" err: Failed to delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393272    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393281    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393310    5389 graceful_termination.go:89] Try delete rs "10.128.33.142:6379/TCP/10.125.6.13:6379" err: Failed to delete rs "10.128.33.142:6379/TCP/10.125.6.13:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393317    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393327    5389 graceful_termination.go:160] Trying to delete rs: 10.128.34.237:6379/TCP/10.125.6.11:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393355    5389 graceful_termination.go:89] Try delete rs "10.128.34.237:6379/TCP/10.125.6.11:6379" err: Failed to delete rs "10.128.34.237:6379/TCP/10.125.6.11:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393362    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.34.237:6379/TCP/10.125.6.11:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393371    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.200:5432/TCP/10.125.6.27:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393397    5389 graceful_termination.go:89] Try delete rs "10.128.33.200:5432/TCP/10.125.6.27:6432" err: Failed to delete rs "10.128.33.200:5432/TCP/10.125.6.27:6432", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393404    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.200:5432/TCP/10.125.6.27:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393412    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.228:6379/TCP/10.125.2.63:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393440    5389 graceful_termination.go:89] Try delete rs "10.128.33.228:6379/TCP/10.125.2.63:6379" err: Failed to delete rs "10.128.33.228:6379/TCP/10.125.2.63:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393447    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.228:6379/TCP/10.125.2.63:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393455    5389 graceful_termination.go:160] Trying to delete rs: 10.128.34.247:5672/TCP/10.125.2.6:5672
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393482    5389 graceful_termination.go:89] Try delete rs "10.128.34.247:5672/TCP/10.125.2.6:5672" err: Failed to delete rs "10.128.34.247:5672/TCP/10.125.2.6:5672", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393489    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.34.247:5672/TCP/10.125.2.6:5672
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393497    5389 graceful_termination.go:160] Trying to delete rs: 10.128.35.47:4007/TCP/10.125.2.15:4007
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393523    5389 graceful_termination.go:89] Try delete rs "10.128.35.47:4007/TCP/10.125.2.15:4007" err: Failed to delete rs "10.128.35.47:4007/TCP/10.125.2.15:4007", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393530    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.35.47:4007/TCP/10.125.2.15:4007
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393538    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393563    5389 graceful_termination.go:89] Try delete rs "10.128.33.241:4059/TCP/10.125.2.26:4059" err: Failed to delete rs "10.128.33.241:4059/TCP/10.125.2.26:4059", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393571    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.241:4059/TCP/10.125.2.26:4059
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393578    5389 graceful_termination.go:160] Trying to delete rs: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393605    5389 graceful_termination.go:89] Try delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432" err: Failed to delete rs "10.128.32.132:5432/TCP/10.125.2.13:6432", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393613    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.32.132:5432/TCP/10.125.2.13:6432
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393621    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.252:6379/TCP/10.125.6.26:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393648    5389 graceful_termination.go:89] Try delete rs "10.128.33.252:6379/TCP/10.125.6.26:6379" err: Failed to delete rs "10.128.33.252:6379/TCP/10.125.6.26:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393656    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.252:6379/TCP/10.125.6.26:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393664    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393690    5389 graceful_termination.go:89] Try delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379" err: Failed to delete rs "10.128.33.237:6379/TCP/10.125.6.34:6379", can't find the real server
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: I1115 07:49:52.393697    5389 graceful_termination.go:93] lw: remote out of the list: 10.128.33.237:6379/TCP/10.125.6.34:6379
Nov 15 07:49:52 hex-48b-pm kube-proxy[5389]: E1115 07:49:52.393704    5389 graceful_termination.go:183] Try flush graceful termination list err
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: I1115 07:50:52.393838    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.156:8125/TCP/10.125.2.21:8125
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: E1115 07:50:52.393940    5389 graceful_termination.go:89] Try delete rs "10.128.33.156:8125/TCP/10.125.2.21:8125" err: device or resource busy
Nov 15 07:50:52 hex-48b-pm kube-proxy[5389]: I1115 07:50:52.393957    5389 graceful_termination.go:160] Trying to delete rs: 10.128.33.142:6379/TCP/10.125.6.13:6379
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420734    5389 proxier.go:1496] Failed to list IPVS destinations, error: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420763    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.33.156:8125/TCP, err: invalid argument
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420889    5389 proxier.go:1485] Failed to get IPVS service, error: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420903    5389 proxier.go:809] Failed to sync endpoint for service: 10.128.35.150:6379/TCP, err: Expected only one service obtained=0
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420950    5389 proxier.go:1455] Failed to add IPVS service "monitoring/extradata-inserter:": file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.420961    5389 proxier.go:812] Failed to sync service: 10.128.34.54:80/TCP, err: file exists
Nov 15 07:51:22 hex-48b-pm kube-proxy[5389]: E1115 07:51:22.421886    5389 proxier.go:1544] Failed to add destination: 10.125.27.66:80, error: file exists

If I restart kube-proxy (manually, ugh), it all recovers and works fine. But it will get stuck again, eventually.

Environment:

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 1
  • Comments: 142 (70 by maintainers)

Commits related to this issue

Most upvoted comments

The PR has been merged in all release branches, so it will be present in:

  • 1.11.7
  • 1.12.5
  • 1.13.2

Giving it a go today. Will report results, if any.

The issue is graceful termination related and I think we have fixed the bug in v1.13.1.

Hi, all. Thanks for the report, I’ll make a research and try to figure out what happened, any progress I’ll report here.

@m1093782566 , @thockin I think we can close this issue (no new reports in about 2 months)

For reference, bug impacts initial releases with graceful termination:

  • 1.11.5 - 1.11.6
  • 1.12.2 - 1.12.4
  • 1.13.0 - 1.13.1

Fix is present after:

  • 1.11.7
  • 1.12.5
  • 1.13.2

I’ve been refreshing the releases pages pretty much every hour for the last couple of days in the hope 1.12.5 will appear so i can patch this bug on multiple clusters. 😃

Its here 😛

Why is it not there for 1.11? 😦

I’ve been refreshing the releases pages pretty much every hour for the last couple of days in the hope 1.12.5 will appear so i can patch this bug on multiple clusters. 😃

Update: after 24 hours, still no issues, I’m stopping my experiment now & consider this a full success. Looking forward to 1.12.5!

No issues after 12+ hours on 1.12.4 with https://github.com/DataDog/kubernetes/commit/9121f0122d805f55f85f42d26c294898fe29a92f – without that I got a lockup in 1.5 hours, so this looks very good!

Ok let’s wait until monday so we also get feedback from @matthiasr and @bjornryden (I want to avoid rushing the cherry-picks and make sure that we have a clean fix this time)

We haven’t had any issues with accessing services while kube-proxy was(and is) running in the IPVS mode and we’ve had quite a few deployments and changes to the services. So far I’d say the patch is working well for us.

@lbernail Great job on fixing it! Much appreciated!

@lbernail my pleasure! So it’s been about two days since I launched kube-proxy with the 1.11.7-beta.1 image. It hasn’t hung on us yet. I can see proxier.go and other goroutines are still alive and printing to the logs. I’ll give it some more time since the last couple of days were really slow and quiet, not a lot of deployments going on. Either way, I have a feeling the mutexes that @lbernail added are helping 😃

I’ll update this issue/thread by the end of the week.

@emptywee Yes it confirms the hypothesis that we have a deadlock between the two goroutines sharing the netlink socket. They are both stuck waiting to receive messages and keep retrying without success (due to the 3s receive timeout).

I’ll try to validate the hypothesis and will start working on a fix as soon as I can

@emptywee here you go: lbernail/hyperkube:v1.11.6-beta.1

@emptywee I created a cherry-pick for 1.11: https://github.com/kubernetes/kubernetes/pull/71848 Hopefully it will get merged very soon and will be in 1.11.6

It has already been merged in 1.12 (will be in 1.12.4) and 1.13 (will be in 1.13.1)

Yes I forgot to add “IPVS” in the release-note in the PR, sorry about that Let us know if it works ok now! (we’ll leave the issue open for a weeks just in case)

Yes it is https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.11.md#v1117:

Fix race condition introduced by graceful termination which can lead to a deadlock in kube-proxy (#72361, @lbernail)

@daigong : the full fix will actually be in 1.13.2 (it depends on how fast #72426 gets merged)

@daigong v1.13.1 have fixed this. In my case, no issues after more than one week.

@VincentFF : there are two issues:

  • failed deletions: fixed in 1.12.4
  • kube-proxy getting stuck due to a deadlock (which is the issue you are describing): hopefully fixed with the PR from yesterday. We are waiting for feedback from @emptywee and @bjornryden who are currently testing the patch. As soon as we have confirmation that the issue is not present anymore I’ll create the cherrypicks for 1.13, 1.12 and 1.11 (if the patch works, it will probably be merged in all branches at the beginning of next week and will be part of the next patch releases)

@li-sen definitely a bug. We are working on it

@emptywee and @bjornryden thanks a lot for testing!

@emptywee : here you go : lbernail/hyperkube:v1.12.4-beta.1 Let us know how it works for you!

These are just information logs:

  • Trying to delete rs : a backend pod from removed, trying to delete it. If it has existing connections, make it enter graceful termination, otherwise delete id
  • Deleting rs : the backend has no connection, removing it from IPVS

@andrewsykim Maybe we could now decrease the level of these logs so they only show above v=5 for instance ?

We’ve had no issues since upgrading to 1.12.5 🎉

any chance for a new build in v1.11.x ?

I build it and deploy to test env, over 24 hours have no issues. Happy new year.

I can join the choir of happy testers. 4 days of no trouble happening.

Thanks a lot to all involved, especially @lbernail. Happy New Year!

@VincentFF Hopefully it will make it to 1.12.5 yes. You can run kube-proxy 1.12.5 with other units running 1.12.3, no problem

I’m also re-running my experiment https://github.com/kubernetes/kubernetes/issues/71071#issuecomment-449402484 again, with the patch back-ported onto 1.12.4. Will report back!

/reopen (to keep an eye on feedback from tests and before creating cherry-picks if everything works ok)

I’ve also upgraded to the beta version and starting our load tests again shortly. We usually break quickly (less than a day) and quite consistently… fingers crossed.

I have a potential fix for the bug, the code is available here: https://github.com/DataDog/kubernetes/commit/9121f0122d805f55f85f42d26c294898fe29a92f The idea is to protect all netlink calls with a Mutex to avoid the deadlock. I’m not sure this is the best solution but it could solve the issue (and if it works confirm that we’ve identified the bug).

If you want/can test it, I created two images with the fix:

  • lbernail/hyperkube:v1.11.7-beta.1
  • lbernail/hyperkube:v1.12.5-beta.1

If you test, let us know if it works and if it doesn’t, please share interesting logs and look at the goroutines to see if some seem stuck in netlink calls.

(I did some testing on my side but limited to a small cluster)

Next time, try sending SIGABRT to the process, so that it prints tasks states.

When I had the problem, it was always doing something inside libnetwork/ipvs/netlink.go, which leads me to suspect that the netlink system call gets stuck for some reason.

@m1093782566 @lbernail After one night obversation, the ipvs sync Error log doesn’t seen anymore, while
it occurs at least once one night before. But still need more time to verify it, thanks a lot, it made me annoyed long time.

@emptywee yes it should work without any problem (I have been running kube-proxy 1.11 on 1.10 clusters for months)

I can also do a 1.11 build later today

The cherry-pick just got merged in 1.11 too We just need to wait for the next patch releases in the 3 branches (but if you want to test it before I can easily create a custom kube-proxy build (or, even easier, an hyperkube docker image with the patch)