flannel: kube-flannel: host->remote-pod networking does not work on first node
I’ve been trying to use kube-flannel as part of the bootkube, but I’m still seeing an issue on bootstrap.
What I’m seeing is that on some nodes host-network --> remote pod (seemingly) hits some kind of race condition when bootstrapping using kube-flannel.
pod-to-pod (on same node and to remote nodes) works fine, this just affects host to remote pod, and it begins working after restarting kube-flannel on the remote node.
Repro steps:
Launch a cluster:
go get github.com/kubernetes-incubator/bootkube
git remote add aaronlevy https://github.com/aaronlevy/bootkube
cd $GOPATH/src/github.com/kubernetes-incubator/bootkube
git fetch aaronlevy
git checkout -b v1.4.0+flannel aaronlevy/v1.4.0+flannel
make clean all
cd hack/multi-node
./bootkube-up
Start nginx pods for testing:
kubectl --kubeconfig=cluster/auth/kubeconfig run n1 --image=nginx --replicas=2
You should will have one nginx pod on each node, but can verify by checking nodeIP of each pod:
kubectl --kubeconfig=cluster/auth/kubeconfig get pods -owide
Make note of the podIP for each of the pods above, and which node they are assigned to. E.g.:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE
default n1-1110676790-e0w0c 1/1 Running 0 12m 10.2.1.2 172.17.4.201
default n1-1110676790-s3qfi 1/1 Running 0 12m 10.2.0.5 172.17.4.101
- The 10.2.0.5 is assigned to the “controller” (c1) node (172.17.4.101)
- The 10.2.1.2 is assinged to the “worker” (w1) node (172.17.4.201)
Test the routability from each host:
- note: This seems to fail in different ways sometimes (e.g. both sides, or just one side). Typically from what I’ve seen the common case is master (host) --> worker (pod) fails.
From controller node (route to remote pod likely doesn’t work):
vagrant ssh c1
$ curl 10.2.0.5 # same node, should work
$ curl 10.2.1.2 # different node, probably doesn't work
From worker node (all routes should work):
vagrant ssh w1
$ curl 10.2.0.5 # different node, probably works
$ curl 10.2.1.2 # same node, should work
Now to resolve the issue:
Kill the kube-flannel pod on the failing “remote destination” side (in this case, worker)
kubectl --kubeconfig=cluster/auth/kubeconfig --namespace=kube-system delete pod kube-flannel-abcde
Wait until the kube-flannel daemonset re-launches a replacment pod, then you should be able to re-do the tests above and all networking should work.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 26 (22 by maintainers)
To add some more details.
I ran an audit script while broken / after fixed (restarting kube-flannel pod on worker node):
Then diff-ing the output I saw that:
Broken
ip route
showed a flannel.1 route with scope link and src 10.2.1.0:Working
ip route
showed a flannel.1 with a global scope and no src:Broken
ip route list table local
has two extra broadcast entries:So a manual resolution that worked.
On worker machine:
Dug deeper, and my hunch now is that after a reboot, the mac address for the vxlan device has changed. You can see that the mac address was updated in the kubernetes node annotations, but doesn’t seem to be reflected in
bridge fdb show
and also the stale mac address gets added to the arp table on L2 misses.So this makes me think flannel is caching the old mac address and never updating - but if you restart the flannel process on the node with the stale entries, everything starts working.
My hunch is the issue is here: https://github.com/coreos/flannel/blob/master/subnet/watch.go#L142
We are only comparing existing leases against the subnet - but not a changed mac address. We would likely need to extend this to also consider changes to the vxlan mac address (which should be in the lease attrs).
However, I’m now wondering, if this is the issue – how has this ever worked? The vxlan mac address will always change regardless of using the kube backend…
Anyway – enough digging for tonight, I’ll try and look some more tomorrow.