flannel: kube-flannel: iptables segfault after upgrade to v0.13.0
The kube-flannel DeamonSet pod fails to check rule existence on some (?) rules, which exists with -1. See log below.
When inspecting the node I find a segfault for each line with exit -1, see the other log.
Expected Behavior
No segfault.
I have no idea what its checking here. There may be non-existing pods in my etcd database. Whatever it is, it should not segfault on it.
Current Behavior
it segfaults every 5 to 10 minutes, on all nodes in the cluster, including master nodes.
(probably cause they all run the same docker image)
logs:
$ kubectl -n kube-system logs pod/kube-flannel-ds-vgsh5
...
E0201 17:13:29.425597 1 iptables.go:115] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t nat -C POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully --wait]: exit status -1:
E0201 17:17:25.111358 1 iptables.go:115] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t filter -C FORWARD -d 10.244.0.0/16 -j ACCEPT --wait]: exit status -1:
the node:
[root@master-node1 ~]# coredumpctl info 2200900
PID: 2200900 (iptables)
UID: 0 (root)
GID: 0 (root)
Signal: 11 (SEGV)
Timestamp: Mon 2021-02-01 18:13:28 CET (2min 35s ago)
Command Line: /sbin/iptables -t nat -C POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully --wait
Executable: /sbin/xtables-nft-multi
Control Group: /kubepods/burstable/podaecc8736-4000-4010-8a94-b49a73b56882/a726f110b905636614f762cc94114afa6810e9a2d5b1c65cdbbd4858943f4118
Slice: -.slice
Boot ID: a27b5949a5eb412ac95fb5c89a0871c3
Machine ID: af7e08e4fa8e42e382da98334f96a1c5
Hostname: master-host1
Storage: /var/lib/systemd/coredump/core.iptables.0.a27b5949a5eb412aa95fb5c89a0871c3.2200900.1612199608000000.lz4
Message: Process 2200900 (iptables) of user 0 dumped core.
Stack trace of thread 1535:
#0 0x00007f1392855e47 n/a (/usr/lib/libnftnl.so.11.3.0)
Possible Solution
Dont know.
Steps to Reproduce (for bugs)
Don’t know as of yet. Don’t see what it wants to check.
Context
Cluster was upgraded to Kubernetes 1.20.2 and Flannel along with it.
Accidentally I upgraded to v0.13.1-rc1 first, seen these segfaults, then downgraded to v0.13.0. Did not solve it.
Your Environment
- Flannel version:
v0.13.0
- Backend used (e.g. vxlan or udp):
vxlan
- Etcd version:
3.4.13-0
- Kubernetes version (if used):
v1.20.2
- Operating System and version:
CentOS Linux release 8.3.2011
- Kernel:
4.18.0-240.10.1.el8_3.x86_64
- Docker version:
docker-ce-19.03.7-3.el7.x86_64
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 19 (7 by maintainers)
Commits related to this issue
- Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to ciaby/flannel by ciaby 3 years ago
- Merge pull request #1449 from ciaby/master Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to flannel-io/flannel by rajatchopra 3 years ago
- Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to luthermonson/flannel by ciaby 3 years ago
- Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
- Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
- Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
- Bump flannel image to v0.15.1 Fixes flannel-io/flannel#1408 Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com> — committed to mfranczy/kubeone by mfranczy 2 years ago
- Bump flannel image to v0.15.1 (#1986) Fixes flannel-io/flannel#1408 Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com> — committed to kubermatic/kubeone by mfranczy 2 years ago
I massaged my system a bit and was able to generate a few coredumps.
It looks like this is happening on random calls of
iptables
,Doing some digging on this, it seems that Alpine is not the only distribution that is hitting this issue, namely Debian has seen it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951477
This could potentially be caused by the fact that we’re using an Alpine userspace on top of a CentOS base system, but it’s interesting to see that Debian as a host is also seeing it.
More investigation into this needs to be performed, but on this first-glance, it appears that this is an issue with
iptables
andlibnftnl
rather than it being a Flannel-specific issue. In the mean time, your system should be functional as Flannel will simply retry (and eventually succeed) at making a call.Same issue here:
v0.13.0-rancher1
host-gw
v3.4.13-rancher1
v1.19.6-rancher1-1
CentOS Linux release 8.3.2011
Linux 4.18.0-240.1.1.el8_3.x86_64
and4.18.0-240.10.1.el8_3.x86_64
docker-ce-20.10.0-3.el8.x86_64
I think I tracked down this issue. This bug looks very much like this: https://bugzilla.redhat.com/show_bug.cgi?id=1812261 The bug got fixed by Redhat first and then in iptables 1.8.5 upstream: https://www.netfilter.org/projects/iptables/files/changes-iptables-1.8.5.txt The issue apparently started with flannel 0.13.0, which is when alpine 3.12 got introduced. That means that in order to fix it we need to move to alpine 3.13 which ships with iptables 1.8.6. I’m opening a PR 😃
One thing which may be an issue for Debian OS is the library ‘nftables’ which seems to be important in combination with IP-Tables. This lib was missing in my setup.