cilium: Don't require host reachable service for ebpf masquerading
Proposal / RFE
Is your feature request related to a problem? Yes
Describe the solution you’d like Today cilium requires host reachable svc to enable bpf masquerading, the restriction is added in this commit.
The reason is that when hostns pods talking to clusterIP, kernel picks node IP as source IP but not cilium_host IP. The packets are still tunneled to remote backend. Due to #12544, the return packet is masqueraded on the remote node. The real fix should be letting kernel pick cilium_host IP for such traffic so that we have symmetric data path.
The feature is useful for kernel < 4.19 where people can enable bpf nodeport and masquerading while keep host-reachable-svc off in kubeproxy partial mode.
Proposal: Passing a --cilium-host-route-cidr flag to cilium-agent and install a route based on that flag:
[cilium-host-route-cidr] via 192.168.4.111 dev cilium_host src 192.168.4.111 mtu 1450
To make the implementation simpler, maybe we could always install this route when --cilium-host-route-cidr is passed regardless of other flags. So for whoever want to enable bpf masquerading without host-reachable-svc, he needs to pass --cilium-host-route-cidr
About this issue
- Original URL
- State: open
- Created 3 years ago
- Comments: 21 (17 by maintainers)
@pchaigno @liuyuan10 @borkmann what is the current plan to support katacontainer even kube-proxy free? We really need this feature since we want to use a kube-proxy free environment now with the 1.11 release and the istio support.
Thanks for the pointers. I think both changes are fairly small and I’ll give it a try if that’s all needed.
Correction:
bpf-lb-sock-hostns-only: "true"does not work for kata for kernel v5.4 (ubuntu 20.04). Update kernel to v5.13 (linux-image-generic-hwe-20.04) fixes the issue. Now we can have kata containers able to reach k8s svc, and hostNet pods able to reach k8s svc at the same time on the same nodes with:@pchaigno Thanks for the link. With that patch, I think when endpoint route is enabled, even pod to pod traffic is having asymetric path where vxlan->lxc is passing through kernel where return traffic doesn’t not.
I think what you mean is when !ENABLE_NODEPORT, all traffic between vxlan and lxc should pass through kernel? let me prepare the patch
I think we only need to do that for
!defined(ENABLE_NODEPORT)which is in phase with what we discussed a couple SIG-Datapath meetings ago: removing all path asymmetries when using kube-proxy.Grep for
nodeport_lbinbpf_overlay.c(NodePort BPF should be enabled on a tunnel device).