cilium: Jumbo payload failed in cilium v1.5.6 and after
Bug report
General Information
- Cilium version (run
cilium version) 1.6 and above - Kernel version (run
uname -a) 4.19.106-coreos - Orchestration system version in use (e.g.
kubectl version, Mesos, …) - Link to relevant artifacts (policies, deployments scripts, …)
- Upload a system dump (run
curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip && python cilium-sysdump.zipand then attach the generated zip file)
How to reproduce the issue
- instruction 1 our services have huge payload is much larger than 8842 in below. We don’t set mtu flag in cilium-agent (cilium pods). from ciliums logs we observed something: cilium v1.6.6
-> stack flow 0xb36bdae0 identity 37560->63143 state established ifindex 0: 10.120.4.78:46866 -> 10.120.1.1:8375 tcp ACK
-> endpoint 2483 flow 0x4c8bbffd identity 1->37560 state established ifindex lxcf9eb6e4589a7: 10.120.4.120 -> 10.120.4.78 DestinationUnreachable(FragmentationNeeded)
187:~# ip r s default via 10.16.64.1 dev eth0 proto dhcp src 10.16.65.187 metric 1024 10.16.64.0/22 dev eth0 proto kernel scope link src 10.16.65.187 10.16.64.1 dev eth0 proto dhcp scope link src 10.16.65.187 metric 1024 10.120.0.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.1.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.2.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.4.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.5.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.6.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.7.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.8.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35 mtu 8842 10.120.8.35 dev cilium_host scope link 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
set cilium_vxlan to match with lxc_health size (9001) as suggestion from @jrfastab but does not work. (cilium v1.6.6) Default it is 1500 ~# ip a show cilium_vxlan 6: cilium_vxlan: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether aa:2b:7d:19:47:f3 brd ff:ff:ff:ff:ff:ff inet6 fe80::a82b:7dff:fe19:47f3/64 scope link valid_lft forever preferred_lft forever We know that if we revert to cilium 1.5 our services is working fine. Or if we set override --mtu 30000 in cilium-agent daemonset with cilium v1.6.6 (or v1.7.2) our services also back to normal.
10.120.8.0/24 via 10.120.4.120 dev cilium_host src 10.120.4.120 mtu 29841
Further investigating the issue we found out that actually our services with jumbo payload started failing first in cilium v1.5.6 it is working fine with cilium v1.5.5
Here is ip r s in cilium v1.5.6 that we captured:
ip r s
default via 10.16.64.1 dev eth0 proto dhcp src 10.16.65.187 metric 1024
10.16.64.0/22 dev eth0 proto kernel scope link src 10.16.65.187
10.16.64.1 dev eth0 proto dhcp scope link src 10.16.65.187 metric 1024
10.120.0.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35
10.120.1.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35
10.120.2.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35
10.120.4.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35
10.120.5.0/24 via 10.120.8.35 dev cilium_host src 10.120.8.35
ip r s in cilium v1.5.7 where it broke also:
10.120.2.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
10.120.4.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
10.120.5.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
10.120.6.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
10.120.7.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
10.120.8.0/24 via 10.120.1.53 dev cilium_host src 10.120.1.53 mtu 8842
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 20 (8 by maintainers)
This is bisect and test build outcome:
@joestringer we initially tried with
v1.6.6- that’s why the initial description has that version in it. Today, though, we rolled back versions until we noticed things working again. Starting with a hypothesis that something in https://github.com/cilium/cilium/pull/8949 broke our use case, we triedv1.5.7and could reproduce the error. We then triedv1.5.6and again were able to repro.v1.5.5was the first version that tested clean.