cilium: FIB lookup failed DROPPED on nodes that are in the same AZ with nodes where the Service pods work
Bug report
Hello, There is an issue on AWS EKS clusters. NodePort of Services (type NodePort and LoadBalancer) is not responding with connection timed out on the nodes which are in the same availability zones with nodes where Service pods work (so, NodePort works on nodes in different AZs and nodes with Service pods).
There are used the latest versions of components: Kubernetes version 1.19 Amazon VPC CNI plug-in 1.7.9 KubeProxy deleted AMI v1.19.6-eks-49a6c0 KERNEL-VERSION 5.4.91-41.139.amzn2.x86_64 and Cilium 1.9.4 (kubeProxyReplacement: strict).
There are no suspicious errors in logs:
kubectl -n kube-system logs cilium-l2c7n --tail -1 | grep error
level=info msg="BPF system config check: NOT OK." error="Kernel Config file not found" subsys=linux-datapath
level=info msg="Auto-disabling \"enable-bpf-clock-probe\" feature since KERNEL_HZ cannot be determined" error="Cannot probe CONFIG_HZ" subsys=daemon
level=error msg="Command execution failed" cmd="[iptables -t mangle -n -L CILIUM_PRE_mangle]" error="exit status 1" subsys=iptables
Cilium configuration:
policyAuditMode: true
policyEnforcementMode: "always"
kubeProxyReplacement: strict
k8sServiceHost: xxx.eks.amazonaws.com
k8sServicePort: 443
svcSourceRangeCheck: false
cni:
chainingMode: aws-cni
masquerade: false
tunnel: disabled
nodeinit:
enabled: true
curl -I http://<NODE_IP>:30080
curl: (28) Connection timed out after 2000 milliseconds
# cilium monitor --type drop -v
level=info msg="Initializing dissection cache..." subsys=monitor
xx drop (FIB lookup failed) flow 0x849b849b to endpoint 0, identity 0->0: <NODE_IP>:39578 -> <POD_IP>:9090 tcp SYN
xx drop (FIB lookup failed) flow 0x849b849b to endpoint 0, identity 0->0: <NODE_IP>:39578 -> <POD_IP>:9090 tcp SYN
...
➭ hubble observe -f --type drop
Handling connection for 4245
Mar 4 10:07:18.457: <NODE_IP>:38580 -> kubernetes-dashboard/kubernetes-dashboard-6466669887-w75xt:9090 FIB lookup failed DROPPED (TCP Flags: SYN)
Mar 4 10:07:21.247: <NODE_IP>:38586 -> kubernetes-dashboard/kubernetes-dashboard-6466669887-w75xt:9090 FIB lookup failed DROPPED (TCP Flags: SYN)
...
# cilium monitor --type drop -vv
------------------------------------------------------------------------------
CPU 01: MARK 0x262e262e FROM 946 DROP: 74 bytes, reason FIB lookup failed
level=info msg="Initializing dissection cache..." subsys=monitor
Ethernet {Contents=[..14..] Payload=[..62..] SrcMAC=0a:a6:c1:92:6a:34 DstMAC=0a:3c:23:76:ec:6b EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=7239 Flags=DF FragOffset=0 TTL=255 Protocol=TCP Checksum=20911 SrcIP=<NODE_IP> DstIP=<POD_IP> Options=[] Padding=[]}
TCP {Contents=[..40..] Payload=[] SrcPort=39590 DstPort=9090(websm) Seq=4211881838 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=26883 Checksum=52559 Urgent=0 Options=[..5..] Padding=[]}
------------------------------------------------------------------------------
CPU 01: MARK 0x262e262e FROM 946 DROP: 74 bytes, reason FIB lookup failed
Ethernet {Contents=[..14..] Payload=[..62..] SrcMAC=0a:a6:c1:92:6a:34 DstMAC=0a:3c:23:76:ec:6b EthernetType=IPv4 Length=0}
IPv4 {Contents=[..20..] Payload=[..40..] Version=4 IHL=5 TOS=0 Length=60 Id=7240 Flags=DF FragOffset=0 TTL=255 Protocol=TCP Checksum=20910 SrcIP=<NODE_IP> DstIP=<POD_IP> Options=[] Padding=[]}
TCP {Contents=[..40..] Payload=[] SrcPort=39590 DstPort=9090(websm) Seq=4211881838 Ack=0 DataOffset=10 FIN=false SYN=true RST=false PSH=false ACK=false URG=false ECE=false CWR=false NS=false Window=26883 Checksum=51556 Urgent=0 Options=[..5..] Padding=[]}
------------------------------------------------------------------------------
...
Seems there is no issue with accessibility from cluster nodes.
If scale Deployment to more then 1 replica there are still ‘FIB lookup failed DROPPED’ but NodePort is responding sometimes and it is enough to be ‘InService’ in AWS Load balancer, so sometimes cliens get time outs.
Thanks!
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19 (8 by maintainers)
For now cannot check and do not have test cluster. I’ll try to check during next upgrade and will write a comment. Thanks.
Could you set
kubeProxyReplacement=disabled?Here you are, it is of
chaining mode:Thanks for the confirmation. We are going to fix the issue in the v1.11 development cycle.
There is no <POD_IP> in the neighbor table befor pinging, and after it: