cilium: Egress Gateway is not SNAT'ing the correct IP

Is there an existing issue for this?

I have searched the existing issues

What happened?

When routing a pod through the egress gateway, it doesn’t seem to be using the correct IP address. On each of my nodes, it shares the same L2 network and has access to a full /24 block across all of the nodes. Each of the nodes has a single /32 from the /24. Ideally, I’d like to use another IP that isn’t currently used from that same /24 block.

For example: if I have 33.44.55.1 as the main IP address, and 33.44.55.25 as an additional IP on the same interface, and I set up my policy like so:

apiVersion: cilium.io/v2
kind: CiliumEgressGatewayPolicy
metadata:
  name: example-policy
spec:
  selectors:
  - podSelector:
      matchLabels:
        k8s.my-pod-label.com/id: abc123
        io.kubernetes.pod.namespace: the-namespace

  destinationCIDRs:
  - "0.0.0.0/0"

  egressGateway:
    nodeSelector:
      matchLabels:
        k8s.my-node-label.com/id: abc123

    egressIP: 33.44.55.25

It actually ends up making the egress IP 33.44.55.1, even though it’s actually defined as 33.44.55.25 in the above config. Additionally, when I run cilium bpf egress list on the Cilium pod that matches the egress gateway node, it shows the correct 33.44.55.25 egress IP, even though that’s incorrect.

I add the secondary IP using the following command: ip addr add 33.44.55.25/32 br 33.44.55.25 dev bond0 But it also seemed to perform the same when permanently attaching the IP to the NIC with nmcli.

Cilium Version

1.14.0-snapshot.2 (but also experienced this in 1.13.2)

Kernel Version

Linux rn6-k8s.va.internal.rocketnode.net 5.14.15-1.el8.elrepo.x86_64 #1 SMP Tue Oct 26 11:45:20 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:“1”, Minor:“24”, GitVersion:“v1.24.10”, GitCommit:“5c1d2d4295f9b4eb12bfbf6429fdf989f2ca8a02”, GitTreeState:“clean”, BuildDate:“2023-01-18T19:08:10Z”, GoVersion:“go1.19.5”, Compiler:“gc”, Platform:“linux/amd64”}

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (6 by maintainers)

Most upvoted comments

I’ve had this fixed for a while, but figured I would follow up here in case anyone else is having this problem.

What I found is that when a MetalLB block is announced on all nodes, you cannot guarantee which node the return traffic hits. For example, if the traffic from our pod is leaving on Node 1, but Node 1, 2 and 3 are all announcing the same IP, there’s no guarantee that the traffic that just left our pod will return on Node 1 - the connection just hangs.

However, if MetalLB is announcing the IP from the same node that the Egress Gateway is sending traffic from, traffic is guaranteed to return on the same node that originally sent it. Meaning, everything will work as intended. (tldr; their node selectors need to match)

My original approach did not account for the latter, which is why it wasn’t working for me. Everything is working solid now and we’re using this setup in production successfully. This would however introduce a single point of failure for ingress, but with some tooling around this, you can dynamically update the MetalLB and Egress Gateway node mappings easily, along with the IPs on the node itself. So if a node goes down, transit can be updated with minimal downtime.

mattmalec on Jul 24, 2023