cilium: pod-to-b-multi-node-nodeport connectivity test failing on EKS with 1.8.0-rc3

When following the EKS GSG instructions to validate Cilium 1.8.0-rc3 for #11903 (including the fix for #12078), the connectivity check for pod-to-b-multi-node-nodeport is failing:

% kubectl get po
NAME                                                     READY   STATUS    RESTARTS   AGE
echo-a-58dd59998d-fcfsc                                  1/1     Running   0          133m
echo-b-865969889d-7qgfg                                  1/1     Running   0          133m
echo-b-host-659c674bb6-tvzxm                             1/1     Running   0          133m
host-to-b-multi-node-clusterip-6fb94d9df6-v25v4          1/1     Running   0          133m
host-to-b-multi-node-headless-7c4ff79cd-2dgct            1/1     Running   0          133m
pod-to-a-5c8dcf69f7-zq2zj                                1/1     Running   0          133m
pod-to-a-allowed-cnp-75684d58cc-tb5jm                    1/1     Running   0          133m
pod-to-a-external-1111-669ccfb85f-8l2p2                  1/1     Running   0          133m
pod-to-a-l3-denied-cnp-7b8bfcb66c-qg2wc                  1/1     Running   0          133m
pod-to-b-intra-node-74997967f8-c88x9                     1/1     Running   0          133m
pod-to-b-intra-node-nodeport-775f967f47-t426f            1/1     Running   0          133m
pod-to-b-multi-node-clusterip-587678cbc4-xskt6           1/1     Running   0          133m
pod-to-b-multi-node-headless-574d9f5894-xd2jq            1/1     Running   0          133m
pod-to-b-multi-node-nodeport-7944d9f9fc-qpv5r            0/1     Running   0          133m
pod-to-external-fqdn-allow-google-cnp-6dd57bc859-bqhhq   1/1     Running   0          133m

% kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
echo-a                 ClusterIP   10.100.133.146   <none>        80/TCP         136m
echo-b                 NodePort    10.100.21.112    <none>        80:31313/TCP   136m
echo-b-headless        ClusterIP   None             <none>        80/TCP         136m
echo-b-host-headless   ClusterIP   None             <none>        <none>         136m
kubernetes             ClusterIP   10.100.0.1       <none>        443/TCP        147m

% kubectl get ep 
NAME                   ENDPOINTS                               AGE
echo-a                 192.168.108.20:80                       136m
echo-b                 192.168.98.169:80                       136m
echo-b-headless        192.168.98.169:80                       136m
echo-b-host-headless   192.168.16.155                          136m
kubernetes             192.168.148.188:443,192.168.97.61:443   147m

Follow-up for #12078

/cc @brb

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 19 (19 by maintainers)

Commits related to this issue

iptables, loader: add rules for multi-node NodePort traffic on EKS Multi-node NodePort traffic for EKS needs a set of specific rules that are usually set by the aws daemonset: # sysctl -w net.ip... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules for multi-node NodePort traffic on EKS Multi-node NodePort traffic for EKS needs a set of specific rules that are usually set by the aws daemonset: # sysctl -w net.ip... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic Multi-node NodePort traffic with AWS ENI needs a set of specific rules that are usually set by the AWS DaemonSet: # sy... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic Multi-node NodePort traffic with AWS ENI needs a set of specific rules that are usually set by the AWS DaemonSet: # sy... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic [ upstream commit 132088c996a59e64d8f848c88f3c0c93a654290c ] Multi-node NodePort traffic with AWS ENI needs a set of speci... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic [ upstream commit 132088c996a59e64d8f848c88f3c0c93a654290c ] Multi-node NodePort traffic with AWS ENI needs a set of speci... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic [ upstream commit c7f9997d7001c8561583d374dcbd4d973bad6fac ] Multi-node NodePort traffic with AWS ENI needs a set of speci... — committed to cilium/cilium by qmonnet 4 years ago
iptables, loader: add rules to ensure symmetric routing for AWS ENI traffic [ upstream commit c7f9997d7001c8561583d374dcbd4d973bad6fac ] Multi-node NodePort traffic with AWS ENI needs a set of speci... — committed to cilium/cilium by qmonnet 4 years ago

Most upvoted comments

I could reproduce the issue by following the GSG for AWS-EKS. I could also apply the manual fix from Thomas, with an additional step for the return-path filter on eth0.

The ENI CNI in the aws-node daemonset sets the following rules and configuration on the node:

$ ip rule | grep -w 0x80
1024:	from all fwmark 0x80/0x80 lookup main 

$ iptables-save | grep -w 0x80
-A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
-A PREROUTING -i eni+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80

$ sysctl net.ipv4.conf.eth0.rp_filter
2

Since the GSG instructs to remove the daemonset before deploying Cilium and creating the node, this configuration is not used, and we have instead:

$ ip rule | grep -w 0x80
[null]

$ iptables-save | grep -w 0x80
[null]

$ ip rule | grep 'from 192.168.126.96'
110:	from 192.168.126.96 to 192.168.0.0/16 lookup 3
$ ip route show table 3
default via 192.168.64.1 dev eth1 
192.168.64.1 dev eth1 scope link

$ sysctl net.ipv4.conf.eth0.rp_filter
1

Cilium sets net.ipv4.conf.all.rp_filter at 0, but the maximum value in conf/{all,interface}/rp_filter is used when doing source validation on an {interface}, so in our case rp_filter is in strict mode on eth0. This prevents the packets received from the first node on eth0 to be (SNAT-ed and) forwarded to the pod. Instead they are dropped by the host and no SYN/ACK is emitted back. Disabling rp_filter or setting it to loose mode fixes it, but the SYN/ACKs are not sent to the correct destination.

This is due to the ip rule that is matched for those packets, it tells the host to do a FIB lookup in table 3 (associated to interface at index 3, eth1 in my case) and not in the main table as should be the case. This is where we need marking the packets and looking at the main table when the mark is found.

I used the following commands to restore the rules and have pod-to-b-multi-node-nodeport getting ready:

# sysctl -w net.ipv4.conf.eth0.rp_filter=2
# iptables -t mangle -A PREROUTING -i eth0 -m comment --comment "AWS, primary ENI" -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-xmark 0x80/0x80
# iptables -t mangle -A PREROUTING -i lxc+ -m comment --comment "AWS, primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
# ip rule add fwmark 0x80/0x80 lookup main

I’m working on a fix to have Cilium reproduce this configuration on AWS.

We likely missed this when validating the GSG for 1.7 because pod-to-b-multi-node-nodeport (or pod-to-b-intra-node-nodeport failing with v1.7.5) did not exist at the time.

qmonnet on Jul 30, 2020