flannel: kube-flannel: iptables segfault after upgrade to v0.13.0

The kube-flannel DeamonSet pod fails to check rule existence on some (?) rules, which exists with -1. See log below.

When inspecting the node I find a segfault for each line with exit -1, see the other log.

Expected Behavior

No segfault.

I have no idea what its checking here. There may be non-existing pods in my etcd database. Whatever it is, it should not segfault on it.

Current Behavior

it segfaults every 5 to 10 minutes, on all nodes in the cluster, including master nodes.

(probably cause they all run the same docker image)

logs:

$ kubectl -n kube-system logs pod/kube-flannel-ds-vgsh5
...
E0201 17:13:29.425597       1 iptables.go:115] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t nat -C POSTROUTING ! -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully --wait]: exit status -1: 
E0201 17:17:25.111358       1 iptables.go:115] Failed to ensure iptables rules: Error checking rule existence: failed to check rule existence: running [/sbin/iptables -t filter -C FORWARD -d 10.244.0.0/16 -j ACCEPT --wait]: exit status -1:

the node:

[root@master-node1 ~]# coredumpctl info 2200900
           PID: 2200900 (iptables)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 11 (SEGV)
     Timestamp: Mon 2021-02-01 18:13:28 CET (2min 35s ago)
  Command Line: /sbin/iptables -t nat -C POSTROUTING -s 10.244.0.0/16 -d 10.244.0.0/16 -j MASQUERADE --random-fully --wait
    Executable: /sbin/xtables-nft-multi
 Control Group: /kubepods/burstable/podaecc8736-4000-4010-8a94-b49a73b56882/a726f110b905636614f762cc94114afa6810e9a2d5b1c65cdbbd4858943f4118
         Slice: -.slice
       Boot ID: a27b5949a5eb412ac95fb5c89a0871c3
    Machine ID: af7e08e4fa8e42e382da98334f96a1c5
      Hostname: master-host1
       Storage: /var/lib/systemd/coredump/core.iptables.0.a27b5949a5eb412aa95fb5c89a0871c3.2200900.1612199608000000.lz4
       Message: Process 2200900 (iptables) of user 0 dumped core.
                
                Stack trace of thread 1535:
                #0  0x00007f1392855e47 n/a (/usr/lib/libnftnl.so.11.3.0)

Possible Solution

Dont know.

Steps to Reproduce (for bugs)

Don’t know as of yet. Don’t see what it wants to check.

Context

Cluster was upgraded to Kubernetes 1.20.2 and Flannel along with it.

Accidentally I upgraded to v0.13.1-rc1 first, seen these segfaults, then downgraded to v0.13.0. Did not solve it.

Your Environment

Flannel version: v0.13.0
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.4.13-0
Kubernetes version (if used): v1.20.2
Operating System and version: CentOS Linux release 8.3.2011
Kernel: 4.18.0-240.10.1.el8_3.x86_64
Docker version: docker-ce-19.03.7-3.el7.x86_64

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 19 (7 by maintainers)

Commits related to this issue

Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to ciaby/flannel by ciaby 3 years ago
Merge pull request #1449 from ciaby/master Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to flannel-io/flannel by rajatchopra 3 years ago
Update to Alpine 3.13, update iptables to 1.8.6, fixes #1408 — committed to luthermonson/flannel by ciaby 3 years ago
Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
Bump flannel images to v0.15.1 across k8s versions The v0.15.1 release of Flannel updates the underlying alpine image used, which in turn updates the underlying iptables version used by the container... — committed to aiyengar2/kontainer-driver-metadata by aiyengar2 3 years ago
Bump flannel image to v0.15.1 Fixes flannel-io/flannel#1408 Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com> — committed to mfranczy/kubeone by mfranczy 2 years ago
Bump flannel image to v0.15.1 (#1986) Fixes flannel-io/flannel#1408 Signed-off-by: Marcin Franczyk <marcin0franczyk@gmail.com> — committed to kubermatic/kubeone by mfranczy 2 years ago

Most upvoted comments

I massaged my system a bit and was able to generate a few coredumps.

It looks like this is happening on random calls of iptables,

Core was generated by `/sbin/iptables -t filter -C FORWARD -d 10.42.0.0/16 -j ACCEPT --wait'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f26f12b7e47 in nftnl_rule_lookup_byindex () from /usr/lib/libnftnl.so.11

Doing some digging on this, it seems that Alpine is not the only distribution that is hitting this issue, namely Debian has seen it: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951477

This could potentially be caused by the fact that we’re using an Alpine userspace on top of a CentOS base system, but it’s interesting to see that Debian as a host is also seeing it.

More investigation into this needs to be performed, but on this first-glance, it appears that this is an issue with iptables and libnftnl rather than it being a Flannel-specific issue. In the mean time, your system should be functional as Flannel will simply retry (and eventually succeed) at making a call.

Oats87 on Feb 9, 2021

Same issue here:

Flannel version: v0.13.0-rancher1
Backend used (e.g. vxlan or udp): host-gw
Etcd version: v3.4.13-rancher1
Kubernetes version (if used): v1.19.6-rancher1-1
Operating System and version: CentOS Linux release 8.3.2011
Kernel: Linux 4.18.0-240.1.1.el8_3.x86_64 and 4.18.0-240.10.1.el8_3.x86_64
Docker version: docker-ce-20.10.0-3.el8.x86_64

oncipriani on Feb 2, 2021

I think I tracked down this issue. This bug looks very much like this: https://bugzilla.redhat.com/show_bug.cgi?id=1812261 The bug got fixed by Redhat first and then in iptables 1.8.5 upstream: https://www.netfilter.org/projects/iptables/files/changes-iptables-1.8.5.txt The issue apparently started with flannel 0.13.0, which is when alpine 3.12 got introduced. That means that in order to fix it we need to move to alpine 3.13 which ships with iptables 1.8.6. I’m opening a PR 😃

ciaby on Jun 1, 2021

One thing which may be an issue for Debian OS is the library ‘nftables’ which seems to be important in combination with IP-Tables. This lib was missing in my setup.

rsoika on Apr 27, 2021