flannel: 60+ seconds stuck when call a http service pod

when flanneld version upgrading to v0.20.1 and curl http service pod in different node via ClusterIP will stuck 60+ seconds.

Expected Behavior

no stuck

Current Behavior

stuck 60+ seconds

Possible Solution

eh… may be caused by double-NAT, i have no idea

Steps to Reproduce (for bugs)

it will stuck curl when nat POSTROUTING order like this:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
FLANNEL-POSTRTG  all  --  anywhere             anywhere             /* flanneld masq */
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

it works fine like this:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */
FLANNEL-POSTRTG  all  --  anywhere             anywhere             /* flanneld masq */

Context

this pr(https://github.com/kubernetes/kubernetes/pull/92035) looks like to solve this issue, but I still have this problem when I use flanneld v0.20.1

Your Environment

Flannel version: v0.20.1
Backend used (e.g. vxlan or udp): vxlan
Etcd version: 3.5.3
Kubernetes version (if used): v1.25.4
Operating System and version: Archlinux (kernel version 6.0.8)
Link to your project (optional):

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 1
Comments: 18 (9 by maintainers)

Most upvoted comments

You can increase the verbosity of iptables output if you use -vL.

rbrtbnfgl on Jan 3, 2023

@rkonfj we believe your kernel still has a vxlan bug which makes you see this problem when double natting. We can avoid it by not double-natting as @rbrtbnfgl suggests. But just to verify, with the original flannel iptable rules and thus double-natting, could you execute in your nodes:

sudo ethtool -K flannel.1 tx-checksum-ip-generic off

And then try again. That should remove the vxlan bug from the equation and thus it should work, even if having double-natting

manuelbuil on Nov 26, 2022

Ok now it’s clear. This bug is only happening on some kernel versions, that’s why it wasn’t happening on my setup. I’ll try to update the iptables rules.

rbrtbnfgl on Nov 23, 2022

@rbrtbnfgl pod to pod via service ip may works fine, but node to pod via service ip will be stuck

rkonfj on Nov 22, 2022