cilium: Don't require host reachable service for ebpf masquerading

Proposal / RFE

Is your feature request related to a problem? Yes

Describe the solution you’d like Today cilium requires host reachable svc to enable bpf masquerading, the restriction is added in this commit.

The reason is that when hostns pods talking to clusterIP, kernel picks node IP as source IP but not cilium_host IP. The packets are still tunneled to remote backend. Due to #12544, the return packet is masqueraded on the remote node. The real fix should be letting kernel pick cilium_host IP for such traffic so that we have symmetric data path.

The feature is useful for kernel < 4.19 where people can enable bpf nodeport and masquerading while keep host-reachable-svc off in kubeproxy partial mode.

Proposal: Passing a --cilium-host-route-cidr flag to cilium-agent and install a route based on that flag:

[cilium-host-route-cidr] via 192.168.4.111 dev cilium_host src 192.168.4.111 mtu 1450

To make the implementation simpler, maybe we could always install this route when --cilium-host-route-cidr is passed regardless of other flags. So for whoever want to enable bpf masquerading without host-reachable-svc, he needs to pass --cilium-host-route-cidr

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 21 (17 by maintainers)

Commits related to this issue

Most upvoted comments

@pchaigno @liuyuan10 @borkmann what is the current plan to support katacontainer even kube-proxy free? We really need this feature since we want to use a kube-proxy free environment now with the 1.11 release and the istio support.

Thanks for the pointers. I think both changes are fairly small and I’ll give it a try if that’s all needed.

Correction: bpf-lb-sock-hostns-only: "true" does not work for kata for kernel v5.4 (ubuntu 20.04). Update kernel to v5.13 (linux-image-generic-hwe-20.04) fixes the issue. Now we can have kata containers able to reach k8s svc, and hostNet pods able to reach k8s svc at the same time on the same nodes with:

kubeProxyReplacement: "strict"
hostServices:
  hostNamespaceOnly: true

@pchaigno Thanks for the link. With that patch, I think when endpoint route is enabled, even pod to pod traffic is having asymetric path where vxlan->lxc is passing through kernel where return traffic doesn’t not.

I think what you mean is when !ENABLE_NODEPORT, all traffic between vxlan and lxc should pass through kernel? let me prepare the patch

@liuyuan10 I think we just need to avoid the redirect() from cilium_vxlan to lxc device on forward path, no?

If we go that route, ideally just in a constraint setting (e.g. old kernels only) given this will likely introduce a performance regression (it’s also incompatible with BPF host routing, but that’s latest kernels only).

I think we only need to do that for !defined(ENABLE_NODEPORT) which is in phase with what we discussed a couple SIG-Datapath meetings ago: removing all path asymmetries when using kube-proxy.

One question, if all pod->node traffic goes through tunnel, how can a pod talk to nodeport?

Grep for nodeport_lb in bpf_overlay.c (NodePort BPF should be enabled on a tunnel device).