cilium: BPF masquerading does not masquerade traffic to remote node's ExternalIP

Bug report

We have a Cilium cluster with native routing (no encapsulation). When starting a pod on any node, the node it runs on is reachable (e.g. ping). However, other nodes external IP is not reachable. So if pod1 runs on node1, ping external_ip_of_node1 works, ping external_ip_of_node2 does not.

Each node has an internal interface (dummy0) and an external interface (external). All nodes are connected internally using a WireGuard interface (wg0) at the OS level (not managed by Cilium) so that all cluster traffic is encrypted. All external interfaces are reachable from each node.

General Information

  • Cilium version (run cilium version)
Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), wait-for-node-init (init), clean-cilium-state (init)
Client: 1.10.3 4145278 2021-07-15T16:11:03+02:00 go version go1.16.6 linux/amd64
Daemon: 1.10.3 4145278 2021-07-15T16:11:03+02:00 go version go1.16.6 linux/amd64
  • Kernel version (run uname -a)
Linux dev-0001 5.8.0-63-generic #71-Ubuntu SMP Tue Jul 13 15:59:12 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Orchestration system version in use (e.g. kubectl version, …)
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3+k3s1", GitCommit:"1d1f220fbee9cdeb5416b76b707dde8c231121f2", GitTreeState:"clean", BuildDate:"2021-07-22T20:52:14Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3+k3s1", GitCommit:"1d1f220fbee9cdeb5416b76b707dde8c231121f2", GitTreeState:"clean", BuildDate:"2021-07-22T20:52:14Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
  • Generate and upload a system zip:
curl -sLO https://git.io/cilium-sysdump-latest.zip && python cilium-sysdump-latest.zip

How to reproduce the issue

  1. Start a network multitool pod on any node (https://github.com/Praqma/Network-MultiTool)
  2. Exec into a shell of the pod
  3. Ping external address: ping 1.1.1.1 works
  4. Ping own node’s external address: works
  5. Ping other node’s external address: does not work

pinging external system

tcpdump

17:42:40.538395 IP 10.100.2.123 > 1.1.1.1: ICMP echo request, id 51, seq 1, length 64
17:42:40.538453 IP NODE1 > 1.1.1.1: ICMP echo request, id 61439, seq 1, length 64
17:42:40.543845 IP 1.1.1.1 > NODE1: ICMP echo reply, id 61439, seq 1, length 64
17:42:40.543892 IP 1.1.1.1 > 10.100.2.123: ICMP echo reply, id 51, seq 1, length 64

cilium monitor

Policy verdict log: flow 0x0 local EP ID 333, remote ID world, proto 1, egress, action allow, match all, 10.100.2.123 -> 1.1.1.1 EchoRequest
-> stack flow 0x0 identity 237465->world state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> 1.1.1.1 EchoRequest

Some NAT seems to be happening here

pinging own node

tcpdump

17:38:29.191152 IP 10.100.2.123 > NODE1: ICMP echo request, id 47, seq 1, length 64
17:38:29.191219 IP NODE1 > 10.100.2.123: ICMP echo reply, id 47, seq 1, length 64

cilium monitor

-> stack flow 0x0 identity 237465->host state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> NODE1 EchoRequest
-> endpoint 333 flow 0x0 identity host->237465 state reply ifindex 0 orig-ip NODE1: NODE1 -> 10.100.2.123 EchoReply

works

pinging other node

tcpdump

17:39:02.437350 IP 10.100.2.123 > $NODE2: ICMP echo request, id 48, seq 1, length 64

cilium monitor

Policy verdict log: flow 0x0 local EP ID 333, remote ID remote-node, proto 1, egress, action allow, match all, 10.100.2.123 -> NODE2 EchoRequest
-> stack flow 0x0 identity 237465->remote-node state new ifindex 0 orig-ip 0.0.0.0: 10.100.2.123 -> NODE2 EchoRequest

no response (and no NAT!)

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 17 (11 by maintainers)

Most upvoted comments

Summarizing discussion from Slack:

Root Cause

The cluster is using BPF-based masquerading:

# kubectl -n kube-system exec cilium-jf7bc -- cilium status --verbose
[...]
Masquerading:           BPF       [dummy0, external]   10.96.0.0/12 [IPv4: Enabled, IPv6: Disabled]

In BPF-based masquerading, we rely on the destination security identity to skip masquerading for cluster nodes. In particular, we filter on the remote-node identity: https://github.com/cilium/cilium/blob/83edea82d215930c1ee714b322d6e59cbbfec766/bpf/lib/nodeport.h#L1203-L1204 However, all IP addresses of remote nodes are associated with this remote-node identity, regardless of internal vs. external.

Workaround

A simple workaround is to disable BPF masquerading to rely on our iptables-based masquerading instead. A more complex workaround would be to handle masquerading yourself.

Long-term Solution

It looks like we shouldn’t be masquerading traffic to external IP addresses. Differentiating between external IP addresses and internal IP addresses in the BPF logic will be a bit more involved than a simple if condition. We may need to have a map with all internal IP addresses. /cc @brb

I have a PR opened to extend this logic to the iptables-based masquerading: https://github.com/cilium/cilium/pull/16603. In the current version of the PR, we skip masquerading for external IP addresses as well. If we decide not to, it’s easy to change.

should we skip masquerading for external IPs of remote nodes?

I think we should only skip masquerading for internal IPs. What would be the reasons for not masquerading external IPs?

Edit : Specifically for the BPF masquerading case, probably complexity is the reason based on Paul’s comment -

Differentiating between external IP addresses and internal IP addresses in the BPF logic will be a bit more involved than a simple if condition.

should we skip masquerading for external IPs of remote nodes?

I think we should only skip masquerading for internal IPs. What would be the reasons for not masquerading external IPs?

Yes, that’s what I meant to write 🤦

Should we skip masquerading only for internal IPs of remote nodes?