k3s: k3s on rhel 8 network/dns probleme and metrics not work
Hello I try to make k3s work in a redhat 8.4 but I encounter network or dns problems, I checked the modprob as well as sysctl but nothing happens maybe is flannel problem ?
firewalld and selinux disabled nm-cloud-setup.service nm-cloud-setup.timer no present k3s installed by script https://get.k3s.io
work fine in rhel 7.9
Environmental Info: K3s Version: k3s version v1.22.5+k3s1 (405bf79d) go version go1.16.10
Node(s) CPU architecture, OS, and Version: Linux vldsocfg01 4.18.0-305.25.1.el8_4.x86_64 #1 SMP Mon Oct 18 14:34:11 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration: 2 master 3 node 2 node only front (traefik / metallb / haproxy )
Describe the bug:
pods crash with dns resolution probleme coredns:
[ERROR] plugin/errors: 2 7635134873774865456.7522827499224113179. HINFO: read udp 10.200.3.11:45684->XXXXXXX:53: i/o timeout
longhorn:
time="2022-01-24T19:50:55Z" level=info msg="CSI Driver: driver.longhorn.io version: v1.2.2, manager URL http://longhorn-backend:9500/v1"
2022/01/24 19:50:03 [emerg] 1#1: host not found in upstream "longhorn-backend" in /etc/nginx/nginx.conf:32
metrics:
E0124 20:17:27.096421 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg03:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg03"
E0124 20:17:27.100536 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg01:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg01"
E0124 20:18:27.049233 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg01:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg01"
E0124 20:18:27.056477 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg02:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg02"
E0124 20:18:27.068495 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg03:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg03"
E0124 20:18:27.076854 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg01:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg01"
E0124 20:18:27.084260 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg01:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg01"
E0124 20:18:27.090960 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg02:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg02"
E0124 20:18:27.104001 1 scraper.go:139] "Failed to scrape node" err="Get \"https://vldsocfg02:10250/stats/summary?only_cpu_and_memory=true\": dial tcp: i/o timeout" node="vldsocfg02"
in logs k3s see metrics error
Jan 24 20:53:03 vldsocfg01 k3s[36279]: E0124 20:53:03.842079 36279 available_controller.go:524] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
Jan 24 20:53:06 vldsocfg01 k3s[36279]: E0124 20:53:06.068125 36279 cri_stats_provider.go:372] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs"
Jan 24 20:53:06 vldsocfg01 k3s[36279]: E0124 20:53:06.068150 36279 kubelet.go:1343] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Jan 24 20:53:06 vldsocfg01 k3s[36279]: E0124 20:53:06.097788 36279 kubelet.go:1991] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be successful]"
Jan 24 20:51:45 vldsocfg01 k3s[33811]: E0124 20:51:45.975122 33811 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.201.36.96:443/apis/metrics.k8s.io/v1beta1: Get "https://10.201.36.96:443/apis/metrics.k8s.io/v1beta1": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Jan 24 20:51:46 vldsocfg01 k3s[33811]: E0124 20:51:46.976471 33811 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
Jan 24 20:51:50 vldsocfg01 k3s[33811]: E0124 20:51:50.983597 33811 available_controller.go:524] v1beta1.metrics.k8s.io failed with: failing or missing response from https://10.201.36.96:443/apis/metrics.k8s.io/v1beta1: Get "https://10.201.36.96:443/apis/metrics.k8s.io/v1beta1": dial tcp 10.201.36.96:443: i/o timeout
Jan 24 20:51:51 vldsocfg01 k3s[33811]: E0124 20:51:51.984292 33811 controller.go:116] loading OpenAPI spec for "v1beta1.metrics.k8s.io" failed with: failed to retrieve openAPI spec, http error: ResponseCode: 503, Body: service unavailable
lsmod Module Size Used by xt_state 16384 0 veth 28672 0 nf_conntrack_netlink 49152 0 xt_recent 20480 6 xt_statistic 16384 21 xt_nat 16384 44 ip6t_MASQUERADE 16384 1 ip_vs_sh 16384 0 ip_vs_wrr 16384 0 ip_vs_rr 16384 0 ip_vs 172032 6 ip_vs_rr,ip_vs_sh,ip_vs_wrr nft_chain_nat 16384 8 ipt_MASQUERADE 16384 5 vxlan 65536 0 ip6_udp_tunnel 16384 1 vxlan udp_tunnel 20480 1 vxlan nfnetlink_log 20480 1 nft_limit 16384 1 ipt_REJECT 16384 5 nf_reject_ipv4 16384 1 ipt_REJECT xt_limit 16384 0 xt_NFLOG 16384 1 xt_physdev 16384 2 xt_conntrack 16384 21 xt_mark 16384 25 xt_multiport 16384 4 xt_addrtype 16384 7 nft_counter 16384 329 xt_comment 16384 296 nft_compat 20480 550 nf_tables 172032 884 nft_compat,nft_counter,nft_chain_nat,nft_limit ip_set 49152 0 nfnetlink 16384 5 nft_compat,nf_conntrack_netlink,nf_tables,ip_set,nfnetlink_log iptable_nat 16384 0 nf_nat 45056 5 ip6t_MASQUERADE,ipt_MASQUERADE,xt_nat,nft_chain_nat,iptable_nat nf_conntrack 172032 8 xt_conntrack,nf_nat,ip6t_MASQUERADE,xt_state,ipt_MASQUERADE,xt_nat,nf_conntrack_netlink,ip_vs nf_defrag_ipv6 20480 2 nf_conntrack,ip_vs nf_defrag_ipv4 16384 1 nf_conntrack cfg80211 835584 0 rfkill 28672 2 cfg80211 vsock_loopback 16384 0 vmw_vsock_virtio_transport_common 32768 1 vsock_loopback vmw_vsock_vmci_transport 32768 1 vsock 45056 5 vmw_vsock_virtio_transport_common,vsock_loopback,vmw_vsock_vmci_transport sunrpc 540672 1 intel_rapl_msr 16384 0 intel_rapl_common 24576 1 intel_rapl_msr isst_if_mbox_msr 16384 0 isst_if_common 16384 1 isst_if_mbox_msr nfit 65536 0 libnvdimm 192512 1 nfit crct10dif_pclmul 16384 1 crc32_pclmul 16384 0 ghash_clmulni_intel 16384 0 rapl 20480 0 vmw_balloon 24576 0 joydev 24576 0 pcspkr 16384 0 vmw_vmci 86016 2 vmw_balloon,vmw_vsock_vmci_transport i2c_piix4 24576 0 br_netfilter 24576 0 bridge 192512 1 br_netfilter stp 16384 1 bridge llc 16384 2 bridge,stp overlay 135168 4 ip_tables 28672 1 iptable_nat xfs 1515520 7 libcrc32c 16384 5 nf_conntrack,nf_nat,nf_tables,xfs,ip_vs sr_mod 28672 0 cdrom 65536 1 sr_mod sd_mod 53248 4 t10_pi 16384 1 sd_mod sg 40960 0 ata_generic 16384 0 vmwgfx 368640 1 crc32c_intel 24576 1 drm_kms_helper 233472 1 vmwgfx syscopyarea 16384 1 drm_kms_helper sysfillrect 16384 1 drm_kms_helper sysimgblt 16384 1 drm_kms_helper fb_sys_fops 16384 1 drm_kms_helper ata_piix 36864 0 ttm 114688 1 vmwgfx serio_raw 16384 0 libata 270336 2 ata_piix,ata_generic drm 569344 4 vmwgfx,drm_kms_helper,ttm vmxnet3 65536 0 vmw_pvscsi 28672 8 dm_mod 151552 21 fuse 151552 1
iptables 1.8.4
in sysctl net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1 net.bridge.bridge-nf-call-ip6tables = 1
Steps To Reproduce:
- Installed K3s:
- rhel 8.4
- multi node and master
tried to delete iptables package of real for use iptables from k3s but same result
UPDATE:
with params --flannel-backend=host-gw is work, but is good fix ? ingress not work with host-gw because front node is not in same network of worker
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655113 103276 route_network.go:54] Watching for new subnet leases
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655271 103276 route_network.go:93] Subnet added: 10.42.4.0/24 via x.y.6.8
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655414 103276 route_network.go:93] Subnet added: 10.42.0.0/24 via x.y.6.3
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: E0125 14:07:43.655508 103276 route_network.go:168] Error adding route to {Ifindex: 2 Dst: 10.42.0.0/24 Src: <nil> Gw: x.y.6.3 Flags: [] Table: 0}
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655532 103276 route_network.go:93] Subnet added: 10.42.1.0/24 via x.y.6.15
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: E0125 14:07:43.655599 103276 route_network.go:168] Error adding route to {Ifindex: 2 Dst: 10.42.1.0/24 Src: <nil> Gw: x.y.6.15 Flags: [] Table: 0}
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655607 103276 route_network.go:93] Subnet added: 10.42.2.0/24 via x.y.6.13
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: E0125 14:07:43.655662 103276 route_network.go:168] Error adding route to {Ifindex: 2 Dst: 10.42.2.0/24 Src: <nil> Gw: x.y.6.13 Flags: [] Table: 0}
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.655673 103276 route_network.go:93] Subnet added: 10.42.3.0/24 via x.y.6.8
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: E0125 14:07:43.655730 103276 route_network.go:168] Error adding route to {Ifindex: 2 Dst: 10.42.3.0/24 Src: <nil> Gw: x.y.6.8 Flags: [] Table: 0}
Jan 25 14:07:43 vldsocfg02-front k3s[103276]: I0125 14:07:43.661130 103276 iptables.go:216] Some iptables rules are missing; deleting and recreating rules
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 33 (17 by maintainers)
Oh ==>
bad udp cksum 0xf1b1 -> 0x521c!,you might be hitting a kernel bug that affects udp + vxlan when using the offloading feature of the kernel. We saw it in Ubuntu but thought it was fixed in RHEL ==> https://github.com/rancher/rke2/issues/1541
Could you please try disabling the offloading in all nodes? Execute this command
sudo ethtool -K flannel.1 tx-checksum-ip-generic offand try againSame issue still happening with RHEL8.6. Pod communication entirely broken.
added the above fix to crontab to band-aid things so it survives reboot.
@reboot ethtool -K flannel.1 tx-checksum-ip-generic offThis fix should be posted in the readme to avoid headaches. It took a bit of digging to find this issue.
We encountered an issue where the
flannel.1interface was not accessible immediately after a reboot. To resolve this, we developed a bash script and established a systemd service as a workaround.sudo vi /usr/local/bin/flannel-fix.shsudo chmod +x /usr/local/bin/flannel-fix.shsudo vi /etc/systemd/system/flannel-fix.serviceNote that there are issues with RHEL 8 and vmware. There is one related to vxlan which maybe is the root cause for our issue ==> https://docs.vmware.com/en/VMware-vSphere/6.7/rn/esxi670-202111001.html#esxi670-202111401-bg-resolved
Thanks for helping and your quick response! Something we need to fix in flannel upstream