flannel: 60+ seconds stuck when call a http service pod

when flanneld version upgrading to v0.20.1 and curl http service pod in different node via ClusterIP will stuck 60+ seconds.

Expected Behavior

no stuck

Current Behavior

stuck 60+ seconds

Possible Solution

eh… may be caused by double-NAT, i have no idea

Steps to Reproduce (for bugs)

it will stuck curl when nat POSTROUTING order like this:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
FLANNEL-POSTRTG  all  --  anywhere             anywhere             /* flanneld masq */
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */

it works fine like this:

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
KUBE-POSTROUTING  all  --  anywhere             anywhere             /* kubernetes postrouting rules */
FLANNEL-POSTRTG  all  --  anywhere             anywhere             /* flanneld masq */

Context

this pr(https://github.com/kubernetes/kubernetes/pull/92035) looks like to solve this issue, but I still have this problem when I use flanneld v0.20.1

Your Environment

  • Flannel version: v0.20.1
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: 3.5.3
  • Kubernetes version (if used): v1.25.4
  • Operating System and version: Archlinux (kernel version 6.0.8)
  • Link to your project (optional):

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 18 (9 by maintainers)

Most upvoted comments

You can increase the verbosity of iptables output if you use -vL.

@rkonfj we believe your kernel still has a vxlan bug which makes you see this problem when double natting. We can avoid it by not double-natting as @rbrtbnfgl suggests. But just to verify, with the original flannel iptable rules and thus double-natting, could you execute in your nodes:

sudo ethtool -K flannel.1 tx-checksum-ip-generic off

And then try again. That should remove the vxlan bug from the equation and thus it should work, even if having double-natting

Ok now it’s clear. This bug is only happening on some kernel versions, that’s why it wasn’t happening on my setup. I’ll try to update the iptables rules.

@rbrtbnfgl pod to pod via service ip may works fine, but node to pod via service ip will be stuck