cilium: Cilium fails to start with "RTNETLINK answers: Operation not supported"

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Architecture: x86_64

I installed cilium via helm chart into our k3s cluster.

helm values:

containerRuntime:
  integration: containerd
  socketPath: /var/run/k3s/containerd/containerd.sock

hubble:
  tls:
    enabled: false
  relay:
    enabled: true
  ui:
    enabled: true

ipam:
  mode: "kubernetes"

operator:
  replicas: 1

kubeProxyReplacement: disabled

Unfortunatly cilium want start and fail with following logs:

level=warning msg="+ tc qdisc replace dev cilium_vxlan clsact" subsys=datapath-loader
level=warning msg="RTNETLINK answers: Operation not supported" subsys=datapath-loader
level=warning msg="+ true" subsys=datapath-loader
level=warning msg="++ grep -v 'pref 1 bpf chain 0 $\\|pref 1 bpf chain 0 handle 0x1'" subsys=datapath-loader
level=warning msg="++ tc filter show dev cilium_vxlan ingress" subsys=datapath-loader
level=warning msg="RTNETLINK answers: Operation not supported" subsys=datapath-loader
level=warning msg="Dump terminated" subsys=datapath-loader
level=warning msg="+ '[' -z '' ']'" subsys=datapath-loader
level=warning msg="+ cilium bpf migrate-maps -s bpf_overlay.o" subsys=datapath-loader
level=warning msg="+ tc filter replace dev cilium_vxlan ingress prio 1 handle 1 bpf da obj bpf_overlay.o sec from-overlay" subsys=datapath-loader
level=warning msg="RTNETLINK answers: Operation not supported" subsys=datapath-loader
level=warning msg="We have an error talking to the kernel" subsys=datapath-loader
level=warning msg="+ cilium bpf migrate-maps -e bpf_overlay.o -r 1" subsys=datapath-loader
level=warning msg="+ return 1" subsys=datapath-loader
level=fatal msg="Error while creating daemon" error="error while initializing daemon: failed while reinitializing datapath: exit status 1" subsys=daemon

Cilium Version

1.12.0

Kernel Version

5.15.0-1013-kvm #16-Ubuntu SMP Fri Jul 1 21:19:48 UTC 2022

Kubernetes Version

v1.24.3+k3s1

Sysdump

No response

Relevant log output

see above

Anything else?

When I upgrade the kernel (by hand) to version: 5.18.0-051800-generic, everything is working fine again.

Unfortunatly, we are using the jammy cloud image and we don’t want to update the kernel by ourself.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (10 by maintainers)

Commits related to this issue

Most upvoted comments

I think I can reproduce the issue by hacking cilium agent kernel config probe code to always return false so cilium agent think the system is missing kernel config file from both /boot/config-xxxx and /proc/config.gz

diff --git a/pkg/datapath/linux/probes/probes.go b/pkg/datapath/linux/probes/probes.go
index c0e01d69b8..28b0af79e5 100644
--- a/pkg/datapath/linux/probes/probes.go
+++ b/pkg/datapath/linux/probes/probes.go
@@ -420,5 +420,5 @@ func (p *ProbeManager) KernelConfigAvailable() bool {
                }
        }
 
-       return true
+       return false
 }

so there is no error about system config check, but later errored due to missing required kernel config option

level=info msg="Cilium 1.12.90 d8909fbda4 2022-08-11T14:08:51+00:00 go version go1.19 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: 5739e4be8ae7134fee683d920d25c3732ac6c819/1.21.5/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (5.19.0) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath
level=info msg="Detected mounted BPF filesystem at /sys/fs/bpf" subsys=bpf
...SNIP..
level=warning msg="+ tc qdisc replace dev cilium_vxlan clsact" subsys=datapath-loader
level=warning msg="Error: Cannot find ingress queue for specified device." subsys=datapath-loader

then I made code change below in addition to above code hack, and rebuild the image, I can see correct log message

diff --git a/pkg/datapath/linux/requirements.go b/pkg/datapath/linux/requirements.go
index 199c135716..b12452550c 100644
--- a/pkg/datapath/linux/requirements.go
+++ b/pkg/datapath/linux/requirements.go
@@ -146,7 +146,7 @@ func CheckMinRequirements() {
                if err := probeManager.SystemConfigProbes(); err != nil {
                        errMsg := "BPF system config check: NOT OK."
                        // TODO(brb) warn after GH#14314 has been resolved
-                       if !errors.Is(err, probes.ErrKernelConfigNotFound) {
+                       if errors.Is(err, probes.ErrKernelConfigNotFound) {
                                log.WithError(err).Warn(errMsg)
                        }
                }

correct log:

level=info msg="Cilium 1.12.90 d8909fbda4 2022-08-11T14:08:51+00:00 go version go1.19 linux/amd64" subsys=daemon
level=info msg="cilium-envoy  version: 5739e4be8ae7134fee683d920d25c3732ac6c819/1.21.5/Distribution/RELEASE/BoringSSL" subsys=daemon
level=info msg="clang (10.0.0) and kernel (5.19.0) versions: OK!" subsys=linux-datapath
level=info msg="linking environment: OK!" subsys=linux-datapath
level=warning msg="BPF system config check: NOT OK." error="Kernel Config file not found" subsys=linux-datapath
...SNIP...
level=warning msg="+ tc qdisc replace dev cilium_vxlan clsact" subsys=datapath-loader
level=warning msg="Error: Cannot find ingress queue for specified device." subsys=datapath-loader

looks to be a minor log bug that I can fix 😃

@borkmann:

config.txt

@dirien I don’t see CONFIG_NET_CLS_ACT=y, probably missing kernel config ? here is kernel requirements

https://docs.cilium.io/en/stable/operations/system_requirements/

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=y
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y

Many thanks @borkmann and @vincentmli for your feedback!

I switched my image from the jammy minimal to “normal” image and now it works as all kernel modules are available!

@dirien Thanks! I think given it’s an distro issue not providing the needed modules, we can close this one here. Could you open a ticket at Canonical so they get this fixed?

@borkmann: config.txt

@dirien I don’t see CONFIG_NET_CLS_ACT=y, probably missing kernel config ? here is kernel requirements

Agree, could you open an issue at Canonical that this is missing in their jammy cloud image?

Also, did the agent log on startup warn you about missing config?

https://github.com/cilium/cilium/blob/b4b750002c229d559aa83b80f7aca30a8084a0bb/pkg/datapath/linux/probes/probes.go#L211

@borkmann, I will recreate the VM to find the informations.

@dirien do you have the kernel config for the jammy cloud image?