cilium: BPF masquerading does not masquerade traffic to remote node's ExternalIP
Bug report
We have a Cilium cluster with native routing (no encapsulation). When starting a pod on any node, the node it runs on is reachable (e.g. ping). However, other nodes external IP is not reachable. So if pod1 runs on node1, ping external_ip_of_node1 works, ping external_ip_of_node2 does not.
Each node has an internal interface (dummy0) and an external interface (external). All nodes are connected internally using a WireGuard interface (wg0) at the OS level (not managed by Cilium) so that all cluster traffic is encrypted. All external interfaces are reachable from each node.
General Information
- Cilium version (run
cilium version)
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), wait-for-node-init (init), clean-cilium-state (init)
Client: 1.10.3 4145278 2021-07-15T16:11:03+02:00 go version go1.16.6 linux/amd64
Daemon: 1.10.3 4145278 2021-07-15T16:11:03+02:00 go version go1.16.6 linux/amd64
- Kernel version (run
uname -a)
Linux dev-0001 5.8.0-63-generic #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Orchestration system version in use (e.g.
kubectl version, …)
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3+k3s1", GitCommit:"1d1f220fbee9cdeb5416b76b707dde8c231121f2", GitTreeState:"clean", BuildDate:"2021-07-22T20:52:14Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3+k3s1", GitCommit:"1d1f220fbee9cdeb5416b76b707dde8c231121f2", GitTreeState:"clean", BuildDate:"2021-07-22T20:52:14Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
- Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip
How to reproduce the issue
- Start a network multitool pod on any node (https://github.com/Praqma/Network-MultiTool)
- Exec into a shell of the pod
- Ping external address:
ping 1.1.1.1works - Ping own node’s external address: works
- Ping other node’s external address: does not work
pinging external system
tcpdump
17:42:40.538395 IP 10.100.2.123 > 1.1.1.1: ICMP echo request, id 51, seq 1, length 64
17:42:40.538453 IP NODE1 > 1.1.1.1: ICMP echo request, id 61439, seq 1, length 64
17:42:40.543845 IP 1.1.1.1 > NODE1: ICMP echo reply, id 61439, seq 1, length 64
17:42:40.543892 IP 1.1.1.1 > 10.100.2.123: ICMP echo reply, id 51, seq 1, length 64
cilium monitor
Policy verdict log: flow 0x0 local EP ID 333, remote ID world, proto 1, egress, action allow, match all, 10.100.2.123 -> 1.1.1.1 EchoRequest
-> stack flow 0x0 identity 237465->world state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> 1.1.1.1 EchoRequest
Some NAT seems to be happening here
pinging own node
tcpdump
17:38:29.191152 IP 10.100.2.123 > NODE1: ICMP echo request, id 47, seq 1, length 64
17:38:29.191219 IP NODE1 > 10.100.2.123: ICMP echo reply, id 47, seq 1, length 64
cilium monitor
-> stack flow 0x0 identity 237465->host state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> NODE1 EchoRequest
-> endpoint 333 flow 0x0 identity host->237465 state reply ifindex 0 orig-ip NODE1: NODE1 -> 10.100.2.123 EchoReply
works
pinging other node
tcpdump
17:39:02.437350 IP 10.100.2.123 > $NODE2: ICMP echo request, id 48, seq 1, length 64
cilium monitor
Policy verdict log: flow 0x0 local EP ID 333, remote ID remote-node, proto 1, egress, action allow, match all, 10.100.2.123 -> NODE2 EchoRequest
-> stack flow 0x0 identity 237465->remote-node state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> NODE2 EchoRequest
no response (and no NAT!)
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 17 (11 by maintainers)
Summarizing discussion from Slack:
Root Cause
The cluster is using BPF-based masquerading:
In BPF-based masquerading, we rely on the destination security identity to skip masquerading for cluster nodes. In particular, we filter on the
remote-nodeidentity: https://github.com/cilium/cilium/blob/83edea82d215930c1ee714b322d6e59cbbfec766/bpf/lib/nodeport.h#L1203-L1204 However, all IP addresses of remote nodes are associated with thisremote-nodeidentity, regardless of internal vs. external.Workaround
A simple workaround is to disable BPF masquerading to rely on our iptables-based masquerading instead. A more complex workaround would be to handle masquerading yourself.
Long-term Solution
It looks like we shouldn’t be masquerading traffic to external IP addresses. Differentiating between external IP addresses and internal IP addresses in the BPF logic will be a bit more involved than a simple
ifcondition. We may need to have a map with all internal IP addresses. /cc @brbI have a PR opened to extend this logic to the iptables-based masquerading: https://github.com/cilium/cilium/pull/16603. In the current version of the PR, we skip masquerading for external IP addresses as well. If we decide not to, it’s easy to change.
I think we should only skip masquerading for internal IPs. What would be the reasons for not masquerading external IPs?
Edit : Specifically for the BPF masquerading case, probably complexity is the reason based on Paul’s comment -
Yes, that’s what I meant to write 🤦
Should we skip masquerading only for internal IPs of remote nodes?