cilium: Host network broken after one of the underlying interfaces of a bond goes down

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

On Equinix Metal the network setup is a bond of two NICs using LACP. When Cilium is used as CNI for Kubernetes on Flatcar Container Linux, and one of the two NIC interfaces goes down, the host network is broken and remains broken even if the underlying interface goes up again.

The network looks normal but ping 1.1 leads to no outgoing packets visible in tcpdump -i bond0 (ping reports 100% packet loss) and ping 127.0.0.1 leads to ping: sendmsg: Operation not permitted.

We could not restore the network even after terminating the Pods and kubelet on the node, flushing nft and deleting the Cilium interfaces (maybe BPF programs are still loaded and not cleaned up?)

IPv6 is not affected, ping6 2606:4700:4700::1111 works.

Cilium Version

1.9, 1.10, 1.11

Kernel Version

from 5.10.52 to 5.10.84

Kubernetes Version

1.22

Sysdump

🔍 Collecting Kubernetes nodes
failed to create sysdump collector: failed to collect Kubernetes nodes: Get "https://136.144.49.47:6443/api/v1/nodes": dial tcp 136.144.49.47:6443: i/o timeout

Relevant log output

Kernel

Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status down for interface, disabling it in 200 ms
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): invalid new link 1 on slave
Feb 04 15:50:06 kernel: mlx5_core 0000:02:00.0: modify lag map port 1:2 port 2:2
Feb 04 15:50:06 kernel: bond0: (slave enp2s0f0np0): link status definitely down, disabling slave
Feb 04 15:51:09 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): lxc_health: link becomes ready
Feb 04 15:51:10 kernel: lxc_health: Caught tx_queue_len zero misconfig
Feb 04 15:51:33 kernel: mlx5_core 0000:02:00.0 enp2s0f0np0: Link down
Feb 04 15:51:33 kernel: mlx5_core 0000:02:00.0 enp2s0f0np0: Link up
Feb 04 15:51:33 kernel: bond0: (slave enp2s0f0np0): link status up again after 200 ms
Feb 04 15:51:33 kernel: bond0: (slave enp2s0f0np0): link status definitely up, 10000 Mbps full duplex
Feb 04 15:51:35 kernel: mlx5_core 0000:02:00.0: modify lag map port 1:1 port 2:2

Anything else?

Flatcar releases 2905.x.y to 3033.x.y are affected, running systemd from 247 to 249 (maybe relevant because systemd-networkd is used)

Flatcar releases 2764.x.y are not affected (kernel 5.10.43, systemd 247)

Reproduce it by provisioning an Equinix Metal machine with Flatcar Stable (used c3.small.x86). Ensure it is on the latest version:

update_engine_client -update
sudo rm -f /etc/systemd/system/containerd.service.d/10-use-cgroupfs.conf
sudo sed -i 's/systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller//' /usr/share/oem/grub.cfg
sudo systemctl reboot

Set up a one-node Cilium cluster, using the script contents at the end:

sudo ./install.sh

Then, this action is valid and should not any harm, but now does:

sudo ip link set enp2s0f0np0 down

(and sudo ip link set enp2s0f0np0 up does not help)

The install.sh script used above:

#!/bin/bash

set -xe

systemctl enable --now docker
modprobe br_netfilter

cat <<EOF | tee /etc/modules-load.d/k8s.conf
br_netfilter
EOF

cat <<EOF | tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
EOF
sysctl --system

CNI_VERSION="v0.8.2"
CRICTL_VERSION="v1.17.0"
RELEASE_VERSION="v0.4.0"
DOWNLOAD_DIR=/opt/bin
RELEASE="$(curl -sSL https://dl.k8s.io/release/stable.txt)"

mkdir -p /opt/cni/bin
mkdir -p /etc/systemd/system/kubelet.service.d

curl() {
	command curl -sSfL "$@"
}

curl "https://github.com/containernetworking/plugins/releases/download/${CNI_VERSION}/cni-plugins-linux-amd64-${CNI_VERSION}.tgz" | tar -C /opt/cni/bin -xz
curl "https://github.com/kubernetes-sigs/cri-tools/releases/download/${CRICTL_VERSION}/crictl-${CRICTL_VERSION}-linux-amd64.tar.gz" | tar -C $DOWNLOAD_DIR -xz
curl "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubelet/lib/systemd/system/kubelet.service" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | tee /etc/systemd/system/kubelet.service
curl "https://raw.githubusercontent.com/kubernetes/release/${RELEASE_VERSION}/cmd/kubepkg/templates/latest/deb/kubeadm/10-kubeadm.conf" | sed "s:/usr/bin:${DOWNLOAD_DIR}:g" | tee /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
curl --remote-name-all https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/{kubeadm,kubelet,kubectl}

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-amd64.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-amd64.tar.gz /opt/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}

chmod +x {kubeadm,kubelet,kubectl}
mv {kubeadm,kubelet,kubectl} $DOWNLOAD_DIR/

systemctl enable --now kubelet
#systemctl status kubelet

cat <<EOF | tee kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
  kubeletExtraArgs:
    volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
controllerManager:
  extraArgs:
    flex-volume-plugin-dir: "/opt/libexec/kubernetes/kubelet-plugins/volume/exec/"
networking:
  podSubnet: "192.168.254.0/24"
EOF

# For explicit cgroupdriver selection
# ---
# kind: KubeletConfiguration
# apiVersion: kubelet.config.k8s.io/v1beta1
# cgroupDriver: systemd

# For containerd
# apiVersion: kubeadm.k8s.io/v1beta2
# kind: InitConfiguration
# nodeRegistration:
#  criSocket: "unix:///run/containerd/containerd.sock

export PATH=$PATH:$DOWNLOAD_DIR

kubeadm config images pull
kubeadm init --config kubeadm-config.yaml

mkdir -p $HOME/.kube
cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

kubectl create -f https://raw.githubusercontent.com/cilium/cilium/v1.9.4/install/kubernetes/quick-install.yaml

kubectl taint nodes --all node-role.kubernetes.io/master-
kubectl get pods -A
kubectl get nodes -o wide

kubectl apply -f https://k8s.io/examples/application/deployment.yaml
kubectl expose deployment.apps/nginx-deployment

curl -L --remote-name-all https://github.com/cilium/cilium-cli/releases/latest/download/cilium-linux-amd64.tar.gz{,.sha256sum}
tar xzvfC cilium-linux-amd64.tar.gz /opt/bin
rm cilium-linux-amd64.tar.gz{,.sha256sum}

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 44 (15 by maintainers)

Commits related to this issue

Most upvoted comments

I faced this today on a very simple Ubuntu 22.04 install - simply running sudo netplan apply manually was enough to break the host network after Cilium had been running. Setting

ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

in /etc/systemd/networkd.conf did indeed fix it, so I wonder if we should document that?

We also hit the same issue with flatcar 3033.2.1 and systemd 249.4. It seems that systemd-networkd removed routing policy rules including local and the host couldn’t recognize localhost because of it. Cilium moves the policy rule for local when L7Poxy is enabled and systemd-networkd regards this local rule as a foreign routing policy and removes it.

$ ip rule list
32766:  from all lookup main
32767:  from all lookup default

network: drop unnecessary routing policy rules https://github.com/systemd/systemd/commit/0b81225e5791f660506f7db0ab88078cf296b771

For Flatcar Stable this works in /etc/systemd/networkd.conf:

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

Then the additional IgnoreCarrierLoss=yes and KeepConfiguration=yes workarounds mentioned in https://github.com/cilium/cilium/issues/18706#issuecomment-1031470456 are not needed.

Does anyone know when this will be released?

This is part of quay.io/cilium/cilium:v1.14.0-snapshot.3 for testing and will go into 1.14.0.

https://github.com/cilium/cilium/releases/tag/v1.14.0-snapshot.3

I just ran into this problem, and it broke both IPv4 and IPv6. The solution in https://github.com/cilium/cilium/issues/18706#issuecomment-1031572546 did fix IPv4. It would have saved me a lot of time if this had been documented in an obvious place. Please consider documenting it in the installation manual, or better yet, fix it automatically.

yes it seems to have. That’s the default configuration into /etc/systemd/network.conf

#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it under the
#  terms of the GNU Lesser General Public License as published by the Free
#  Software Foundation; either version 2.1 of the License, or (at your option)
#  any later version.
#
# Entries in this file show the compile time defaults. Local configuration
# should be created by either modifying this file, or by creating "drop-ins" in
# the networkd.conf.d/ subdirectory. The latter is generally recommended.
# Defaults can be restored by simply deleting this file and all drop-ins.
#
# See networkd.conf(5) for details.

[Network]
#SpeedMeter=no
#SpeedMeterIntervalSec=10sec
#ManageForeignRoutingPolicyRules=yes
#ManageForeignRoutes=yes
#RouteTable=

[DHCPv4]
#DUIDType=vendor
#DUIDRawData=

[DHCPv6]
#DUIDType=vendor
#DUIDRawData=

I do

sed -i /etc/systemd/networkd.conf -e 's/^#ManageForeignRoutingPolicyRules=yes$/ManageForeignRoutingPolicyRules=no/g'
sed -i /etc/systemd/networkd.conf -e 's/^#ManageForeignRoutes=yes$/ManageForeignRoutes=no/g'
systemctl restart systemd-networkd

Then setup cilium again and no more host network breaking during cilium network device creation

Ok, this manually brought it to working state:

sudo ip rule add from all fwmark 0x200/0xf00 lookup 2004 pref 9
sudo ip rule add from all fwmark 0xa00/0xf00 lookup 2005 pref 10
sudo ip rule add from all lookup local pref 100

Not sure if it’s easy to create a networkd unit that does the same. (If it is possible it would be a way to avoid requiring manual tweaking of global networkd settings if Cilium writes a networkd unit into the host’s /run to play well with networkd by default.)

You can use ManageForeignRoutingPolicyRules=no to protect policies. https://github.com/systemd/systemd/commit/d94dfe7053d49fa62c4bfc07b7f3fc2227c10aff

However, things are a little bit complicated because of this bug. This bug is backported to systems v249.5. With systemd v249.4, you need to set both IgnoreCarrierLoss=yes and KeepConfiguration=yes as well and never use networkctl reconfigure or networkctl reload. Or you need to create policy rules before networkd starts. I think the latter is difficult in our case.

Only ManageForeignRoutingPolicyRules=no is sufficient if you can use systemd v249.5.

Having same problem with:

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

on Ubuntu Server 22.04.2 LTS:

$ systemd --version
systemd 249 (249.11-0ubuntu3.9)
$ uname -a
Linux srv-magic-master 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

with 1.14.0.snapshot.2

once I stopped k3s with k3s-killall.sh - network access is lost, but host still replies to pings

@borkmann would it make sense to have a preflight check for cilium around these settings, given that they can vary OS to OS by default and the combination of systemd and cilium configuration seem to be in conflict with eachother?

Good question, yes, I think it would make sense to have some automation on this regard to avoid users running into it. I haven’t checked yet how reliably one could detect systemd versions and/or whether older versions of systemd would just ignore some of the not-yet-supported networkd.conf settings. A preflight check seems sensible. cc @aanm

The preflight check is only intended for Cilium upgrades. I assume the issue can happen when the kernel upgrades but Cilium is not. Unless you were referring to a preflight inside the cilium-agent pod?

I was thinking of a start-up check. The issue is we faced was when switch maintenance, we lost an entire Kubernetes cluster because every node had lost host networking, and restarting was the fastest fix to restore service. In effect, these systemd-networkd + cilium configurations are dangerous for a production environment where I anticipated to have LACP up and working.

The networkd drop-ins do resolve my particular use case, however, I can’t be the only cilium user on a similar setup.

Thanks @ysksuzuki for the findings

Right, with Unmanaged=yes the rules are ignored and the only way for Cilium to tell networkd to preserve them is by putting them into an active .network unit (e.g., for a dummy interface)…

For reference, this is the translation into the networkd syntax:

[RoutingPolicyRule]
From=0.0.0.0/0
Table=local
Priority=100
[RoutingPolicyRule]
From=0.0.0.0/0
Table=2004
FirewallMark=512/3840
Priority=9
[RoutingPolicyRule]
From=0.0.0.0/0
Table=2005
FirewallMark=2560/3840
Priority=10

I think the networkd people had their reasons already from a “server” profile perspective. The way forward is that Flatcar, as a distro that expects Cilium to be running, would predefine the global setting like we do for similar cases. Maybe it makes sense for Cilium to add a check in the cilium install phase whether networkd is used and then recommending to change the default (plus having Unmanaged=yes is also a good idea even though it is only needed if people try to match too generically in their own networkd units as happend with Flatcar’s default unit).

On the topic of the disappearing lo entry without ManageForeignRoutingPolicyRules=no (on Flatcar Alpha), without Cilium running the down/up action of the underlying device has no impact and 0: from all lookup local stays in the list. With Cilium the rule is gone… Edit: Now I read Cilium moves the policy rule for local when L7Poxy is enabled and systemd-networkd regards this local rule as a foreign routing policy and removes it. again, that explains it.

These are the local addresses, pinging them gives packet loss while pinging 127.0.0.1 gives the permission denied issue as mentioned above.

Sure, I realize now that the quick-install.yaml used a hardcoded older rversion, will do it again.