flannel: kube-flannel: host->remote-pod networking does not work on first node

I’ve been trying to use kube-flannel as part of the bootkube, but I’m still seeing an issue on bootstrap.

What I’m seeing is that on some nodes host-network --> remote pod (seemingly) hits some kind of race condition when bootstrapping using kube-flannel.

pod-to-pod (on same node and to remote nodes) works fine, this just affects host to remote pod, and it begins working after restarting kube-flannel on the remote node.

Repro steps:

Launch a cluster:

go get github.com/kubernetes-incubator/bootkube
git remote add aaronlevy https://github.com/aaronlevy/bootkube
cd $GOPATH/src/github.com/kubernetes-incubator/bootkube
git fetch aaronlevy
git checkout -b v1.4.0+flannel aaronlevy/v1.4.0+flannel
make clean all
cd hack/multi-node
./bootkube-up

Start nginx pods for testing:

kubectl --kubeconfig=cluster/auth/kubeconfig run n1 --image=nginx --replicas=2

You should will have one nginx pod on each node, but can verify by checking nodeIP of each pod:

kubectl --kubeconfig=cluster/auth/kubeconfig get pods -owide

Make note of the podIP for each of the pods above, and which node they are assigned to. E.g.:

NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE       IP             NODE
default       n1-1110676790-e0w0c                        1/1       Running   0          12m       10.2.1.2       172.17.4.201
default       n1-1110676790-s3qfi                        1/1       Running   0          12m       10.2.0.5       172.17.4.101

The 10.2.0.5 is assigned to the “controller” (c1) node (172.17.4.101)
The 10.2.1.2 is assinged to the “worker” (w1) node (172.17.4.201)

Test the routability from each host:

note: This seems to fail in different ways sometimes (e.g. both sides, or just one side). Typically from what I’ve seen the common case is master (host) --> worker (pod) fails.

From controller node (route to remote pod likely doesn’t work):

vagrant ssh c1
$ curl 10.2.0.5 # same node, should work
$ curl 10.2.1.2 # different node, probably doesn't work

From worker node (all routes should work):

vagrant ssh w1
$ curl 10.2.0.5 # different node, probably works
$ curl 10.2.1.2 # same node, should work

Now to resolve the issue:

Kill the kube-flannel pod on the failing “remote destination” side (in this case, worker)

kubectl --kubeconfig=cluster/auth/kubeconfig --namespace=kube-system delete pod kube-flannel-abcde

Wait until the kube-flannel daemonset re-launches a replacment pod, then you should be able to re-do the tests above and all networking should work.

About this issue

Original URL
State: closed
Created 8 years ago
Comments: 26 (22 by maintainers)

Most upvoted comments

To add some more details.

I ran an audit script while broken / after fixed (restarting kube-flannel pod on worker node):

#!/bin/bash
CMD="ip link show && ip addr && ip route && arp -an && bridge fdb show && ip route show table local && sudo iptables-save"
for SERVER in w1 c1; do echo "Server: $SERVER"; vagrant ssh $SERVER -- "${CMD}"; done

Then diff-ing the output I saw that:

Broken ip route showed a flannel.1 route with scope link and src 10.2.1.0:

10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0

Working ip route showed a flannel.1 with a global scope and no src:

10.2.0.0/16 dev flannel.1

Broken ip route list table local has two extra broadcast entries:

broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

So a manual resolution that worked.

On worker machine:

sudo ip route del 10.2.0.0/16 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route add 10.2.0.0/16 dev flannel.1 scope global
sudo ip route del table local broadcast 10.2.0.0 dev flannel.1  proto kernel  scope link  src 10.2.1.0
sudo ip route del table local broadcast 10.2.255.255 dev flannel.1  proto kernel  scope link  src 10.2.1.0

aaronlevy on Nov 2, 2016

Dug deeper, and my hunch now is that after a reboot, the mac address for the vxlan device has changed. You can see that the mac address was updated in the kubernetes node annotations, but doesn’t seem to be reflected in bridge fdb show and also the stale mac address gets added to the arp table on L2 misses.

So this makes me think flannel is caching the old mac address and never updating - but if you restart the flannel process on the node with the stale entries, everything starts working.

My hunch is the issue is here: https://github.com/coreos/flannel/blob/master/subnet/watch.go#L142

We are only comparing existing leases against the subnet - but not a changed mac address. We would likely need to extend this to also consider changes to the vxlan mac address (which should be in the lease attrs).

However, I’m now wondering, if this is the issue – how has this ever worked? The vxlan mac address will always change regardless of using the kube backend…

Anyway – enough digging for tonight, I’ll try and look some more tomorrow.

aaronlevy on Nov 30, 2016