cilium: systemd 245 breaks cilium pod to out-of-node traffic

Bug report

General Information Updating systemd 244.2-2 on Arch to systemd 245.2-1 and 245-3 break pod to out-of-node ipv4 traffic. Reverting to 244.2-2 and rebooting fixes the problem. (ipv6 keeps working on all versions)

I did a sysctl -a diff with 244 vs 245 with cilium running (ready):

< net.ipv4.conf.all.promote_secondaries = 1
> net.ipv4.conf.all.promote_secondaries = 0
< net.ipv4.conf.cilium_host.accept_source_route = 1
> net.ipv4.conf.cilium_host.accept_source_route = 0
< net.ipv4.conf.cilium_host.promote_secondaries = 0
> net.ipv4.conf.cilium_host.promote_secondaries = 1
< net.ipv4.conf.cilium_host.rp_filter = 0
> net.ipv4.conf.cilium_host.rp_filter = 2
< net.ipv4.conf.cilium_net.accept_source_route = 1
> net.ipv4.conf.cilium_net.accept_source_route = 0
< net.ipv4.conf.cilium_net.promote_secondaries = 0
> net.ipv4.conf.cilium_net.promote_secondaries = 1
< net.ipv4.conf.default.accept_source_route = 1
> net.ipv4.conf.default.accept_source_route = 0
< net.ipv4.conf.default.promote_secondaries = 0
> net.ipv4.conf.default.promote_secondaries = 1
< net.ipv4.conf.default.rp_filter = 0
> net.ipv4.conf.default.rp_filter = 2
< net.ipv4.conf.ens192.accept_source_route = 1
> net.ipv4.conf.ens192.accept_source_route = 0
< net.ipv4.conf.ens192.promote_secondaries = 0
> net.ipv4.conf.ens192.promote_secondaries = 1
< net.ipv4.conf.ens192.rp_filter = 0
> net.ipv4.conf.ens192.rp_filter = 2
< net.ipv4.conf.lo.accept_source_route = 1
> net.ipv4.conf.lo.accept_source_route = 0
< net.ipv4.conf.lo.promote_secondaries = 0
> net.ipv4.conf.lo.promote_secondaries = 1
< net.ipv4.conf.lo.rp_filter = 0
> net.ipv4.conf.lo.rp_filter = 2
  • Cilium version (run cilium version) 1.7.1
  • Kernel version (run uname -a) Linux k8s22 5.5.10-arch1-1 #1 SMP PREEMPT Wed, 18 Mar 2020 08:40:35 +0000 x86_64 GNU/Linux
  • Orchestration system version in use (e.g. kubectl version, Mesos, …) Kubernetes 1.17.4
  • Upload a system dump (run curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip && python cilium-sysdump.zip and then attach the generated zip file)

cilium-sysdump-20200319-221054.zip

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 41 (30 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve hit the same problem Ubuntu 20.04.

For future googlers on hetzner systems: Check /etc/sysctl.d/99-hetzner.conf, they set net.ipv4.conf.all.rp_filter=1 there.

Same issue with systemd 248 (248.3-1ubuntu8) on Ubuntu 21.10 (Impish) with Cilium 1.10-rc0. It was very hard to debug and it would probably be wise to document the rp_filter setting in the official installation docs so new users don’t run into this issue until there is a proper fix in place.

The breaking change is in /usr/lib/sysctl.d/50-default.conf https://github.com/systemd/systemd/commit/5d4fc0e665a3639f92ac880896c56f9533441307#diff-7816eed8ca6324f23a690cc5f58e6bf7

a minimal fix for 245 is:

echo 'net.ipv4.conf.lxc*.rp_filter = 0' | sudo tee -a /etc/sysctl.d/90-override.conf && sudo systemctl start systemd-sysctl

When I added this to 1.10 feature candidates, I was meaning #14955 should resolve this. It’s actually clearer to track #14955 in the project so I’ve removed this one from 1.10. This doesn’t change which release we intend to address this issue more generically.

I have encountered this on OpenShift, which uses CoreOS. I can confirm that following two solutions worked well.

Either write /etc/sysctl.d/99-override_cilium_rp_filter.conf with the following contents:

net.ipv4.conf.lxc*.rp_filter = 0
net.ipv4.conf.cilium_*.rp_filter = 0

Or use enable-endpoint-routes: "true", however if you are using tunnelling mode, you will require either Cilium 1.8.5 (not yet released due to be released soon), or 1.9.0 (also due to be released) (see https://github.com/cilium/cilium/pull/13346).

@kkourt I create cilium/cilium-cli#594 as you say. 😃

In case it helps anyone, I can confirm this still happening with systemd 249 (249.3-1-arch), systemd-networkd enabled, and cilium 1.10.3. The sysctl override workaround fixed the issue for me (after recreating all pods).

Just to weigh in here for anyone else who might be searching for it, this affected my systems on NixOS 21.03 (and may indeed affect previous releases). The sysctl configuration mentioned above can be implemented using the boot.kernel.sysctl configuration option

could detect systemd version and write out configs for each interface, how would that sound to people?

Long-term, everything will trend towards newer systemd so unless we expect a significant portion of users to be running non-systemd hosts, I don’t think the “detect systemd version” piece makes any meaningful difference; I’d skip that to keep things simple.

Coordinating with systemd seems like the right solution, at the moment my understanding is that the initial lifecycle of the device from creation to configuration by {cilium,systemd} is not done in a coordinated manner, which means that we end up arguing with each other and the last one to perform configuration is the one that wins. Seems like that’s systemd. Therefore, I think we either need to tell systemd how to configure the devices, or we need a mechanism to know when systemd will no longer mess with the configuration at which point we know it’s safe for us to do so. The former approach seems more viable.

If we were to just write one file with wildcard as documented, there is a good chance it can get overwritten, deleted or shadowed by another config file. But having a config file for each interface will reduce the chances of collision.

As long as we’re happy with the notion of writing up to say ~100 of these on a given node, and re-writing them on the filesystem every time a new pod is deployed (perhaps dozens of times per minute). I suspect this is probably not noticeable from a performance perspective, given how small the configuration would be. Alternative is we write one configuration file and then just have some background monitor that double-checks that the configuration is still OK (for instance, by validating that a random endpoint’s sysctl configuration matches what we expect). Just an idea, we do similar sorts of things via pkg/controller today.

It looks like network interface sysctls are applied to new interfaces.

From the docs:

The settings configured with sysctl.d files will be applied early on boot. The network interface-specific options will also be applied individually for each network interface as it shows up in the system. (More specifically, net.ipv4.conf.*, net.ipv6.conf.*, net.ipv4.neigh.* and net.ipv6.neigh.*).

I had a look at the code also, but not quite sure how exactly configuration is applied to each new device, but per above it’s clear that some part of systemd does it for sure (albeit not systemd-sysctl itself).

It looks like the best fix would be to make either node init or CNI plugin installer drop a config file, and in addition to that all attempt where Cilium tries to write sysctls should check the value after writing and at least log an error/warning in case the intended value doesn’t persist.

Aside from that, another question to ask is - is there a way to enforce sysctls from an eBPF program? Just seems like it could a better way to continuously ensure rp_filter setting is correct.

A good workaround for this is to enable endpoint-routes --enable-endpoint-routes. It enforces symmetric routing. https://github.com/cilium/cilium/pull/13346 is going to fix endpoint-routes in combination with tunneling.

@mvisonneau OK great, would you mind filing a separate bug for that to help track fixing the regression? The output from your last couple of comments on this thread would be a great start for such a bug.