weave: weave-net hangs - leaking IP addresses?

What you expected to happen?

/status etc to work

What happened?

/status (and all other endpoints) hang

Issuance of IP addresses to pods also seemed to stop on the affected nodes

How to reproduce it?

Unknown. Happened on multiple nodes in our cluster

Anything else we need to know?

Versions:

$ weave --local version
weave 2.5.0
$ docker version
$ uname -a
Linux kubernetes-kubernetes-cr0-17-1547767704 4.14.67-coreos #1 SMP Mon Sep 10 23:14:26 UTC 2018 x86_64 Linux
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.10", GitCommit:"098570796b32895c38a9a1c9286425fb1ececa18", GitTreeState:"clean", BuildDate:"2018-08-02T17:19:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.10", GitCommit:"098570796b32895c38a9a1c9286425fb1ececa18", GitTreeState:"clean", BuildDate:"2018-08-02T17:11:51Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

or, if using Kubernetes:

$ kubectl logs -n kube-system <weave-net-pod> weave

(After triggering a kill -ABRT)

weave-logs-1548156065-weave-net-4mdf6.txt

Network:

$ ip route
$ ip -4 -o addr
$ sudo iptables-save

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 29 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Yes, that worked. If i return the error from the CNI, then kubelet retry would fix the leak. It does seem kubelet is stateful, even if i restart the kublet, it does re-attempt to CNI DEL. It seems we could rely on kubelet instead of implementing PruneOwned that works for non docker.

Generally we do see kubelet call CNI to remove pods on restart, even dead ones. I wonder if, given crashes, restarts, etc., Weave Net can get more and more out of sync over time.

I am able to easily reproduce the leaking IP scenario

So I do kubectl delete pods weave-net-bkxln -n kube-system; kubectl delete pods frontend-69859f6796-nq8d6 for the pods running on the same node.

kubelet and CNI interaction fails as expected

May 01 10:13:47 weave-node2 kubelet[1674]: weave-cni: error removing interface "eth0": Link not found
May 01 10:13:47 weave-node2 kubelet[1674]: weave-cni: unable to release IP address: Delete http://127.0.0.1:6784/ip/8bc16eac402e41046ab40817a4ebd4aa00156824a6156b0c6c5d33b5d3389abb: dial tcp 127.0.0.1:6784: connect: connection refused
May 01 10:13:47 weave-node2 kubelet[1674]: E0501 10:13:47.465731    1674 remote_runtime.go:109] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to set up sandbox container "8bc16eac402e41046ab40817a4ebd4aa00156824a6156b0c6c5d33b5d3389abb" network 
May 01 10:13:47 weave-node2 kubelet[1674]: E0501 10:13:47.465764    1674 kuberuntime_sandbox.go:68] CreatePodSandbox for pod "frontend-69859f6796-ps54z_default(b44a7622-6bcb-11e9-ab4a-08002737ffe1)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "8bc16e
May 01 10:13:47 weave-node2 kubelet[1674]: E0501 10:13:47.465781    1674 kuberuntime_manager.go:693] createPodSandbox for pod "frontend-69859f6796-ps54z_default(b44a7622-6bcb-11e9-ab4a-08002737ffe1)" failed: rpc error: code = Unknown desc = failed to set up sandbox container "8bc16
May 01 10:13:47 weave-node2 kubelet[1674]: E0501 10:13:47.465819    1674 pod_workers.go:190] Error syncing pod b44a7622-6bcb-11e9-ab4a-08002737ffe1 ("frontend-69859f6796-ps54z_default(b44a7622-6bcb-11e9-ab4a-08002737ffe1)"), skipping: failed to "CreatePodSandbox" for "frontend-6985
May 01 10:13:47 weave-node2 kubelet[1674]: E0501 10:13:47.716332    1674 kubelet_pods.go:1093] Failed killing the pod "frontend-69859f6796-nq8d6": failed to "KillContainer" for "php-redis" with KillContainerError: "rpc error: code = Unknown desc = Error: No such container: 835adc9c
May 01 10:13:48 weave-node2 kubelet[1674]: W0501 10:13:48.210269    1674 cni.go:309] CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "8bc16eac402e41046ab40817a4ebd4aa00156824a6156b0c6c5d33b5d3389abb"
May 01 10:13:48 weave-node2 kubelet[1674]: weave-cni: unable to release IP address: 400 Bad Request: Delete: no addresses for 8bc16eac402e41046ab40817a4ebd4aa00156824a6156b0c6c5d33b5d3389abb

there is no external attempt (from kubelet) to retry CNI DEL command leaking the IP.

So weave-net pod has crashed or is unresponsive then IP’s are leaked if the pods are deleted.

For info, we’ve ended up writing and deploying this: https://github.com/ocadotechnology/weave-wiper

The leaking IP addresses is continuing. We’re leaking around 300 per day. screenshot from 2019-02-04 17-38-07