flannel: Flannel fails to communicate between pods after node reboot

No interpod communication works after nodes are restarted. Requires docer running on each node to be manually stoped & started.

Expected Behavior

Pods should work fine

Current Behavior

DNS and all other connections timeout when trying to reach other pods

Possible Solution

Not sure, that’s why I’m here!

Steps to Reproduce (for bugs)

Full steps from fresh Ubuntu install and details are here: https://github.com/kubernetes/kubernetes/issues/104645 but TL;DR:

Install Flannel
Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default. It works
Restart Node
Run kubectl exec -i -t dnsutils -- nslookup kubernetes.default in the pod on the Node that restarted. It fails with ;; connection timed out; no servers could be reached

Context

New to Kubernetes and this was really annoying to figure out. Went down so many wrong paths. Took ages to figure out what was going on. Learned a lot though. I have tried these solutions with no success:

Flannel logs (See line entry: I0828 09:00:22.327495):

# (Comment: This log is from the dnsutils pod running on the restarted node).
$ kubectl logs kube-flannel-ds-zv7nf -n kube-system
I0828 09:00:20.802275       1 main.go:520] Determining IP address of default interface
I0828 09:00:20.803003       1 main.go:533] Using interface with name enp3s0 and address 10.7.60.12
I0828 09:00:20.803045       1 main.go:550] Defaulting external address to interface address (10.7.60.12)
W0828 09:00:20.804272       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 09:00:21.236919       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 09:00:21.237044       1 kube.go:299] Starting kube subnet manager
I0828 09:00:22.237113       1 kube.go:123] Node controller sync successful
I0828 09:00:22.237173       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-002
I0828 09:00:22.237186       1 main.go:257] Installing signal handlers
I0828 09:00:22.237447       1 main.go:392] Found network config - Backend type: vxlan
I0828 09:00:22.237559       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 09:00:22.327495       1 main.go:357] Current network or subnet (10.244.0.0/16, 10.244.2.0/24) is not equal to previous one (0.0.0.0/0, 0.0.0.0/0), trying to recycle old iptables rules
I0828 09:00:22.804776       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 -d 0.0.0.0/0 -j RETURN
I0828 09:00:22.806656       1 iptables.go:172] Deleting iptables rule: -s 0.0.0.0/0 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.131186       1 main.go:307] Setting up masking rules
I0828 09:00:23.133140       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 09:00:23.133319       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 09:00:23.133340       1 main.go:327] Running backend.
I0828 09:00:23.133362       1 main.go:345] Waiting for all goroutines to exit
I0828 09:00:23.133393       1 vxlan_network.go:59] watching for new subnet leases
I0828 09:00:23.200217       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200246       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.200581       1 iptables.go:148] Some iptables rules are missing; deleting and recreating rules
I0828 09:00:23.200738       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.202487       1 iptables.go:172] Deleting iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.299811       1 iptables.go:172] Deleting iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.302004       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.302092       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.397339       1 iptables.go:160] Adding iptables rule: -d 10.244.0.0/16 -j ACCEPT
I0828 09:00:23.397598       1 iptables.go:172] Deleting iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully
I0828 09:00:23.399463       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 -d 10.244.0.0/16 -j RETURN
I0828 09:00:23.499174       1 iptables.go:160] Adding iptables rule: -s 10.244.0.0/16 ! -d 224.0.0.0/4 -j MASQUERADE --random-fully
I0828 09:00:23.502667       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.2.0/24 -j RETURN
I0828 09:00:23.599067       1 iptables.go:160] Adding iptables rule: ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully


# (Comment: This is from the node that wasn't restarted and is working fine. Restarting dns queries it breaks though and the cluster needs to be reinstalled)
$ kubectl logs kube-flannel-ds-j5n42 -n kube-system
I0828 08:54:55.315700       1 main.go:520] Determining IP address of default interface
I0828 08:54:55.316066       1 main.go:533] Using interface with name eno1 and address 10.7.60.11
I0828 08:54:55.316084       1 main.go:550] Defaulting external address to interface address (10.7.60.11)
W0828 08:54:55.316103       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0828 08:54:55.416101       1 kube.go:116] Waiting 10m0s for node controller to sync
I0828 08:54:55.416149       1 kube.go:299] Starting kube subnet manager
I0828 08:54:56.416389       1 kube.go:123] Node controller sync successful
I0828 08:54:56.416436       1 main.go:254] Created subnet manager: Kubernetes Subnet Manager - k-w-001
I0828 08:54:56.416464       1 main.go:257] Installing signal handlers
I0828 08:54:56.416673       1 main.go:392] Found network config - Backend type: vxlan
I0828 08:54:56.416733       1 vxlan.go:123] VXLAN config: VNI=1 Port=0 GBP=false Learning=false DirectRouting=false
I0828 08:54:56.443901       1 main.go:307] Setting up masking rules
I0828 08:54:56.719917       1 main.go:315] Changing default FORWARD chain policy to ACCEPT
I0828 08:54:56.720021       1 main.go:323] Wrote subnet file to /run/flannel/subnet.env
I0828 08:54:56.720035       1 main.go:327] Running backend.
I0828 08:54:56.720047       1 main.go:345] Waiting for all goroutines to exit
I0828 08:54:56.720072       1 vxlan_network.go:59] watching for new subnet leases

Your Environment

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:44:22Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:39:34Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}

$ kubelet --version
Kubernetes v1.22.1

Flannel version: https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Backend used (e.g. vxlan or udp): vxlan is a word I see in the logs, so guessing that one.
Etcd version:
Operating System and version:

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="21.04 (Hirsute Hippo)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 21.04"
VERSION_ID="21.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=hirsute
UBUNTU_CODENAME=hirsute

$ uname -a
Linux k-m-001 5.11.0-31-generic #33-Ubuntu SMP Wed Aug 11 13:19:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21

Most upvoted comments

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0 Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown https://quay.io/repository/coreos/flannel?tab=tags

@rajatchopra could you please push the v0.15.1 images to the repo?

manuelbuil on Oct 26, 2021

I tested out 0.15.1 and can confirm that this is now fixed. Thank you!

Slyke on Nov 15, 2021

I’m seeing the same issue here.

Deleting flannel pods fixes the issue for me but is super annoying.

$ kubectl delete pod -n kube-system -l app=flannel

Using Ubuntu 20.04.3 nodes (VMs), Kubernetes 1.22.1 and flannel:v0.14.0

I was successfully running Flannel on old version of Kubernetes <=v1.19 for several quarters, never noticed this behavior, it all started after upgrading Kubernetes and Flannel, not sure which one is the culprit.

jkowalski on Aug 30, 2021

It seems the Images are not yet publicly available, Last version on Quay is flannel:v0.15.0 Error response from daemon: manifest for quay.io/coreos/flannel:v0.15.1 not found: manifest unknown: manifest unknown https://quay.io/repository/coreos/flannel?tab=tags

lossos on Oct 26, 2021

Looks like this PR may fix this issue: https://github.com/flannel-io/flannel/pull/1485

Slyke on Oct 19, 2021

Problem appears on Debian Bullseye (Debian 11) with Kernel 5.10.0-9-amd64 #1 SMP Debian 5.10.70-1 (2021-09-30) and Kubernetes 1.22.2 with flannel 0.14.0 or 0.15.0-rc1 as well.

It seems to be a problem with a vxlan, see also #k3s-io/k3s#3863

lossos on Oct 19, 2021

I am experiencing the same issue after upgrading nodes in a kubernetes cluster from Debian buster to bullseye. The version of the flannell image is v0.13.0-rancher1 .

ghost on Oct 1, 2021