cilium: Host network broken after one of the underlying interfaces of a bond goes down

Is there an existing issue for this?

I have searched the existing issues

What happened?

On Equinix Metal the network setup is a bond of two NICs using LACP. When Cilium is used as CNI for Kubernetes on Flatcar Container Linux, and one of the two NIC interfaces goes down, the host network is broken and remains broken even if the underlying interface goes up again.

The network looks normal but ping 1.1 leads to no outgoing packets visible in tcpdump -i bond0 (ping reports 100% packet loss) and ping 127.0.0.1 leads to ping: sendmsg: Operation not permitted.

We could not restore the network even after terminating the Pods and kubelet on the node, flushing nft and deleting the Cilium interfaces (maybe BPF programs are still loaded and not cleaned up?)

IPv6 is not affected, ping6 2606:4700:4700::1111 works.

Cilium Version

1.9, 1.10, 1.11

Kernel Version

from 5.10.52 to 5.10.84

Kubernetes Version

1.22

Sysdump

🔍 Collecting Kubernetes nodes
failed to create sysdump collector: failed to collect Kubernetes nodes: Get "https://136.144.49.47:6443/api/v1/nodes": dial tcp 136.144.49.47:6443: i/o timeout

Relevant log output

Kernel

Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): invalid new link 1 on slave
Feb 04 15:50:06 kernel: mlx5_core 0000:02:00.0: modify lag map port 1:2 port 2:2
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status definitely down, disabling slave
Feb 04 15:51:09 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Feb 04 15:51:10 kernel: lxc_health: Caught tx_queue_len zero misconfig
Feb 04 15:51:33 kernel: mlx5_core 0000:02:00.0 enp2s0f0np0: Link down
Feb 04 15:51:33 kernel: mlx5_core 0000:02:00.0 enp2s0f0np0: Link up
Feb 04 15:51:33 kernel: bond0: (slave enp2s0f0np0): link status up again after 200 ms
Feb 04 15:51:33 kernel: bond0: (slave enp2s0f0np0): link status definitely up, 10000 Mbps full duplex
Feb 04 15:51:35 kernel: mlx5_core 0000:02:00.0: modify lag map port 1:1 port 2:2

Anything else?

Flatcar releases 2905.x.y to 3033.x.y are affected, running systemd from 247 to 249 (maybe relevant because systemd-networkd is used)

Flatcar releases 2764.x.y are not affected (kernel 5.10.43, systemd 247)

Reproduce it by provisioning an Equinix Metal machine with Flatcar Stable (used c3.small.x86). Ensure it is on the latest version:

update_engine_client -update
sudo rm -f /etc/systemd/system/containerd.service.d/10-use-cgroupfs.conf
sudo sed -i 's/systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller//' /usr/share/oem/grub.cfg
sudo systemctl reboot

Set up a one-node Cilium cluster, using the script contents at the end:

sudo ./install.sh

Then, this action is valid and should not any harm, but now does:

sudo ip link set enp2s0f0np0 down

(and sudo ip link set enp2s0f0np0 up does not help)

The install.sh script used above:

#!/bin/bash

set -xe

systemctl enable --now docker
modprobe br_netfilter

cat <<EOF | tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF

cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system

CNI_VERSION="v0.8.2"
CRICTL_VERSION="v1.17.0"
RELEASE_VERSION="v0.4.0"
DOWNLOAD_DIR=/opt/bin
RELEASE="$(curl -sSL https://dl.k8s.io/release/stable.txt)"

mkdir -p /opt/cni/bin
mkdir -p /etc/systemd/system/kubelet.service.d

curl() {
	command curl -sSfL "$@"
}

curl "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-amd64-${CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz
curl "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz" | tar -C $DOWNLOAD_DIR -xz
curl "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | tee /etc/systemd/system/kubelet.service
curl "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
curl --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-amd64.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-amd64.tar.gz /opt/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}

chmod +x {kubeadm,kubelet,kubectl}
mv {kubeadm,kubelet,kubectl} $DOWNLOAD_DIR/

systemctl enable --now kubelet
#systemctl status kubelet

cat <<EOF | tee kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controllerManager:
  extraArgs:
    flex-volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
networking:
  podSubnet: "192.168.254.0/24"
EOF

# For explicit cgroupdriver selection
# ---
# kind: KubeletConfiguration
# apiVersion: kubelet.config.k8s.io/v1beta1
# cgroupDriver: systemd

# For containerd
# apiVersion: kubeadm.k8s.io/v1beta2
# kind: InitConfiguration
# nodeRegistration:
#  criSocket: "unix:///run/containerd/containerd.sock

export PATH=$PATH:$DOWNLOAD_DIR

kubeadm config images pull
kubeadm init --config kubeadm-config.yaml

mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

kubectl create -f https://raw.githubusercontent.com/cilium/cilium/v1.9.4/install/kubernetes/quick-install.yaml

kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl get pods -A
kubectl get nodes -o wide

kubectl apply -f https://k8s.io/examples/application/deployment.yaml
kubectl expose deployment.apps/nginx-deployment

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz{,.sha256sum}
tar xzvfC cilium-linux-amd64.tar.gz /opt/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 44 (15 by maintainers)

Commits related to this issue

systemd: disable foreign route management While systemd-networkd follows the principle of a declarative network configuration and thus needs a way to ensure that unwanted routes or routing policy rul... — committed to flatcar/init by pothos 2 years ago
systemd: disable foreign route management While systemd-networkd follows the principle of a declarative network configuration and thus needs a way to ensure that unwanted routes or routing policy rul... — committed to flatcar/init by pothos 2 years ago
systemd: disable foreign route management While systemd-networkd follows the principle of a declarative network configuration and thus needs a way to ensure that unwanted routes or routing policy rul... — committed to flatcar/init by pothos 2 years ago
init.sh: install ip rules with proto kernel In order to workaround systemd's bad recent changes where they decided to manage "foreign" rules and to flush them on certain events (e.g. device flap), we... — committed to cilium/cilium by deleted user a year ago
egressgw: use proto kernel for fib routes and rules Use RTPROT_KERNEL (proto kernel) when installing routes and rules in egress gateway to make sure systemd doesn't play with them. For more informati... — committed to cilium/cilium by deleted user a year ago
datapath/loader: use proto kernel for ENI fib rules Use RTPROT_KERNEL (proto kernel) when installing ENI fib rules to make sure systemd doesn't play with them. For more information see [1]. [1] http... — committed to cilium/cilium by deleted user a year ago
datapath/linux/routing: use proto kernel for fib routes and rules Use RTPROT_KERNEL (proto kernel) when installing fib routes and rules to make sure systemd doesn't play with them. The migration code... — committed to cilium/cilium by deleted user a year ago
datapath/linux/node: use proto kernel for fib rules and routes Use RTPROT_KERNEL (proto kernel) when installing fib rules and routes to make sure systemd doesn't play with them. Note that the patch d... — committed to cilium/cilium by deleted user a year ago
egressgw: use proto kernel for fib routes and rules Use RTPROT_KERNEL (proto kernel) when installing routes and rules in egress gateway to make sure systemd doesn't play with them. For more informati... — committed to cilium/cilium by deleted user a year ago
datapath/loader: use proto kernel for ENI fib rules and routes Use RTPROT_KERNEL (proto kernel) when installing ENI fib rules and routes to make sure systemd doesn't play with them. For more informat... — committed to cilium/cilium by deleted user a year ago
datapath/linux/routing: use proto kernel for fib routes and rules Use RTPROT_KERNEL (proto kernel) when installing fib routes and rules to make sure systemd doesn't play with them. The migration code... — committed to cilium/cilium by deleted user a year ago
datapath/linux/node: use proto kernel for fib rules and routes Use RTPROT_KERNEL (proto kernel) when installing fib rules and routes to make sure systemd doesn't play with them. Note that the patch d... — committed to cilium/cilium by deleted user a year ago
egressgw: use proto kernel for fib routes and rules Use RTPROT_KERNEL (proto kernel) when installing routes and rules in egress gateway to make sure systemd doesn't play with them. For more informati... — committed to cilium/cilium by deleted user a year ago

Most upvoted comments

I faced this today on a very simple Ubuntu 22.04 install - simply running sudo netplan apply manually was enough to break the host network after Cilium had been running. Setting

ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

in /etc/systemd/networkd.conf did indeed fix it, so I wonder if we should document that?

bmcustodio on May 28, 2022

We also hit the same issue with flatcar 3033.2.1 and systemd 249.4. It seems that systemd-networkd removed routing policy rules including local and the host couldn’t recognize localhost because of it. Cilium moves the policy rule for local when L7Poxy is enabled and systemd-networkd regards this local rule as a foreign routing policy and removes it.

$ ip rule list
32766:  from all lookup main
32767:  from all lookup default

network: drop unnecessary routing policy rules https://github.com/systemd/systemd/commit/0b81225e5791f660506f7db0ab88078cf296b771

ysksuzuki on Feb 7, 2022

For Flatcar Stable this works in /etc/systemd/networkd.conf:

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

Then the additional IgnoreCarrierLoss=yes and KeepConfiguration=yes workarounds mentioned in https://github.com/cilium/cilium/issues/18706#issuecomment-1031470456 are not needed.

pothos on Feb 7, 2022

Does anyone know when this will be released?

This is part of quay.io/cilium/cilium:v1.14.0-snapshot.3 for testing and will go into 1.14.0.

https://github.com/cilium/cilium/releases/tag/v1.14.0-snapshot.3

borkmann on Jun 16, 2023

I just ran into this problem, and it broke both IPv4 and IPv6. The solution in https://github.com/cilium/cilium/issues/18706#issuecomment-1031572546 did fix IPv4. It would have saved me a lot of time if this had been documented in an obvious place. Please consider documenting it in the installation manual, or better yet, fix it automatically.

steffann on Feb 16, 2023

yes it seems to have. That’s the default configuration into /etc/systemd/network.conf

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it under the
#  terms of the GNU Lesser General Public License as published by the Free
#  Software Foundation; either version 2.1 of the License, or (at your option)
#  any later version.
#
# Entries in this file show the compile time defaults. Local configuration
# should be created by either modifying this file, or by creating "drop-ins" in
# the networkd.conf.d/ subdirectory. The latter is generally recommended.
# Defaults can be restored by simply deleting this file and all drop-ins.
#
# See networkd.conf(5) for details.

[Network]
#SpeedMeter=no
#SpeedMeterIntervalSec=10sec
#ManageForeignRoutingPolicyRules=yes
#ManageForeignRoutes=yes
#RouteTable=

[DHCPv4]
#DUIDType=vendor
#DUIDRawData=

[DHCPv6]
#DUIDType=vendor
#DUIDRawData=

I do

sed -i /etc/systemd/networkd.conf -e 's/^#ManageForeignRoutingPolicyRules=yes$/ManageForeignRoutingPolicyRules=no/g'
sed -i /etc/systemd/networkd.conf -e 's/^#ManageForeignRoutes=yes$/ManageForeignRoutes=no/g'
systemctl restart systemd-networkd

Then setup cilium again and no more host network breaking during cilium network device creation

RouxAntoine on May 12, 2022

Ok, this manually brought it to working state:

sudo ip rule add from all fwmark 0x200/0xf00 lookup 2004 pref 9
sudo ip rule add from all fwmark 0xa00/0xf00 lookup 2005 pref 10
sudo ip rule add from all lookup local pref 100

Not sure if it’s easy to create a networkd unit that does the same. (If it is possible it would be a way to avoid requiring manual tweaking of global networkd settings if Cilium writes a networkd unit into the host’s /run to play well with networkd by default.)

pothos on Feb 7, 2022

You can use ManageForeignRoutingPolicyRules=no to protect policies. https://github.com/systemd/systemd/commit/d94dfe7053d49fa62c4bfc07b7f3fc2227c10aff

However, things are a little bit complicated because of this bug. This bug is backported to systems v249.5. With systemd v249.4, you need to set both IgnoreCarrierLoss=yes and KeepConfiguration=yes as well and never use networkctl reconfigure or networkctl reload. Or you need to create policy rules before networkd starts. I think the latter is difficult in our case.

Only ManageForeignRoutingPolicyRules=no is sufficient if you can use systemd v249.5.

ysksuzuki on Feb 7, 2022

Having same problem with:

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

on Ubuntu Server 22.04.2 LTS:

$ systemd --version
systemd 249 (249.11-0ubuntu3.9)
$ uname -a
Linux srv-magic-master 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

with 1.14.0.snapshot.2

once I stopped k3s with k3s-killall.sh - network access is lost, but host still replies to pings

kamikaze on May 19, 2023

@borkmann would it make sense to have a preflight check for cilium around these settings, given that they can vary OS to OS by default and the combination of systemd and cilium configuration seem to be in conflict with eachother?

Good question, yes, I think it would make sense to have some automation on this regard to avoid users running into it. I haven’t checked yet how reliably one could detect systemd versions and/or whether older versions of systemd would just ignore some of the not-yet-supported networkd.conf settings. A preflight check seems sensible. cc @aanm

The preflight check is only intended for Cilium upgrades. I assume the issue can happen when the kernel upgrades but Cilium is not. Unless you were referring to a preflight inside the cilium-agent pod?

I was thinking of a start-up check. The issue is we faced was when switch maintenance, we lost an entire Kubernetes cluster because every node had lost host networking, and restarting was the fastest fix to restore service. In effect, these systemd-networkd + cilium configurations are dangerous for a production environment where I anticipated to have LACP up and working.

The networkd drop-ins do resolve my particular use case, however, I can’t be the only cilium user on a similar setup.

andy-v-h on Feb 10, 2022

Thanks @ysksuzuki for the findings

pothos on Feb 7, 2022

Right, with Unmanaged=yes the rules are ignored and the only way for Cilium to tell networkd to preserve them is by putting them into an active .network unit (e.g., for a dummy interface)…

For reference, this is the translation into the networkd syntax:

[RoutingPolicyRule]
From=0.0.0.0/0
Table=local
Priority=100
[RoutingPolicyRule]
From=0.0.0.0/0
Table=2004
FirewallMark=512/3840
Priority=9
[RoutingPolicyRule]
From=0.0.0.0/0
Table=2005
FirewallMark=2560/3840
Priority=10

I think the networkd people had their reasons already from a “server” profile perspective. The way forward is that Flatcar, as a distro that expects Cilium to be running, would predefine the global setting like we do for similar cases. Maybe it makes sense for Cilium to add a check in the cilium install phase whether networkd is used and then recommending to change the default (plus having Unmanaged=yes is also a good idea even though it is only needed if people try to match too generically in their own networkd units as happend with Flatcar’s default unit).

pothos on Feb 7, 2022

On the topic of the disappearing lo entry without ManageForeignRoutingPolicyRules=no (on Flatcar Alpha), without Cilium running the down/up action of the underlying device has no impact and 0: from all lookup local stays in the list. With Cilium the rule is gone… Edit: Now I read Cilium moves the policy rule for local when L7Poxy is enabled and systemd-networkd regards this local rule as a foreign routing policy and removes it. again, that explains it.

pothos on Feb 7, 2022

These are the local addresses, pinging them gives packet loss while pinging 127.0.0.1 gives the permission denied issue as mentioned above.

Sure, I realize now that the quick-install.yaml used a hardcoded older rversion, will do it again.

pothos on Feb 7, 2022