cilium: Complexity issue with Linux 4.19.207

Bug report

General Information

  • Cilium version
09:19:59 # cilium version
cilium-cli: v0.9.1 compiled with go1.17.1 on linux/amd64
cilium image (default): v1.10.4
cilium image (stable): v1.10.5
cilium image (running): v1.10.4
  • Kernel version
Linux control-01 4.19.0-18-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
  • Orchestration system version in use
09:20:03 # kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.15", GitCommit:"58178e7f7aab455bc8de88d3bdd314b64141e7ee", GitTreeState:"clean", BuildDate:"2021-09-15T19:23:02Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.15", GitCommit:"58178e7f7aab455bc8de88d3bdd314b64141e7ee", GitTreeState:"clean", BuildDate:"2021-09-15T19:18:00Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • Link to relevant artifacts (policies, deployments scripts, …)
None
  • Generate and upload a system zip:
Take a look at the .zip in attachment to this issue

How to reproduce the issue

  1. Install Debian 10 from the latest available ISO
  2. Install it on a Virtual Machine. (I’m using VmWare 6.5.x)
  3. Create a total of VM Like this. (3 Controle Plane node / 3 Worker nodes)
  4. Run apt update; apt upgrade on all VMs.
  5. Configure Kubernetes apt repository
  6. Run apt install kubeadm=1.19.15-00 kubectl=1.19.15-00 kubelet=1.19.15-00 docker on all VMs.
  7. Run apt install iproute2 on all VMs.
  8. Configure a load balancer for the control plane endpoint (I’m using haproxy/keepalived), it must listen on port 6444
  9. Run kubeadm init --control-plane-endpoint k8s-apiserver:6444 --pod-network-cidr 10.217.0.0/16 --upload-certs
  10. Run the join command for control plane on two Control Plane node.
  11. Run the join command for worker node, on the three remaining node
  12. Install cilium-cli
  13. Run cilium install
  14. Run cilium status --verbose

Result: You should have this error:

cilium-health-ep                      11m12s ago     4s ago       28      Get "http://10.0.0.35:4240/hello": dial tcp 10.0.0.35:4240: connect: no route to host

And a cluster health like this:

Cluster health:                          0/6 reachable   (2021-10-20T09:07:56Z) 
  Name                                   IP              Node        Endpoints  
  kubernetes/worker-02 (localhost)   100.121.22.21   reachable   unreachable
  kubernetes/control-01              100.121.22.10   reachable   unreachable
  kubernetes/control-02              100.121.22.11   reachable   unreachable
  kubernetes/control-03              100.121.22.12   reachable   unreachable
  kubernetes/worker-01               100.121.22.20   reachable   unreachable
  kubernetes/worker-03               100.121.22.22   reachable   unreachable

cilium-sysdump-20211020-092131.zip

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 5
  • Comments: 26 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Thank you for the report.

As @joamaki pointed out, this is a complexity failure on bpf_lxc.c

I believe this might be potentially fixed by a backport of https://github.com/cilium/cilium/pull/17573. The latter depends on a backport of a test #17652, that we are currently working on. CC: @pchaigno. So I’m hoping we will have a potential fix for this soon.

Hello ! Same problem here. We are using Debian 10 and cilium no longer works with 4.19.0-18-amd64 but works with 4.19.0-17-amd64.

We are using k8s 1.21.1 with cri-o 1.21.3. The helm command to install cilium was :

helm install cilium cilium/cilium --version 1.10.5         \               
  --namespace kube-system                                  \
  --set encryption.enabled=true                            \
  --set encryption.type=ipsec                              \
  --set cluster.name=xxxxxxx                               \
  --set cluster.id=1                                       \
  --set ipam.operator.clusterPoolIPv4PodCIDR=10.244.0.0/16 \
  --set ipam.operator.clusterPoolIPv4MaskSize=24           \
  --set etcd.enabled=true                                  \
  --set nodeinit.restartPods=true                          \
  --set "etcd.endpoints[0]=https://xx.xx.xx.xx:2379"       \
  --set "etcd.endpoints[1]=https://xx.xx.xx.xx:2379"       \
  --set "etcd.endpoints[2]=https://xx.xx.xx.xx:2379"       \
  --set etcd.ssl=true                                      \
  --set identityAllocationMode=kvstore

Same Problem in Debian10(kernel version 4.19.20 and 4.19.235)

I install cilium with cilium-cli install in 4.19.20 and then get error logs like this:

level=error msg="Command execution failed" cmd="[tc filter replace dev lxc_health ingress prio 1 handle 1 bpf da obj 142_next/bpf_lxc.o sec from-container]" error="exit status 1" subsys=datapath-loader
level=warning msg="libbpf: Error loading BTF: Invalid argument(22)" subsys=datapath-loader
level=warning msg="libbpf: magic: 0xeb9f" subsys=datapath-loader
......
.....
level=warning msg="BPF program is too large. Processed 131073 insn" subsys=datapath-loader

I try to compile kernel 4.19.235 with required options,

CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=y
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=y
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y

but still get the same wrong. Finally I upgrade a node (i have two node in cluster ) to 5.10.120 and fix this issue.

Here is my currently enviroment

Cilium Version

cilium-cli: v0.11.10 compiled with go1.18.3 on linux/amd64
cilium image (default): v1.11.6
cilium image (stable): v1.11.6
cilium image (running): v1.11.6

Nodes Info

NAME K8S VERSION KERNEL VERSION IP AGENT STATE
192.192.100.38 v1.24.2 4.19.0-20-amd64 192.192.100.38 load bpf failed
master.cluster.k8s v1.24.2 5.10.120 192.192.100.29 work fine

Cilium Config

apiVersion: v1
data:
  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  annotate-k8s-node: "true"
  arping-refresh-period: 30s
  auto-direct-node-routes: "false"
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "0"
  cluster-name: kubernetes
  cluster-pool-ipv4-cidr: 10.0.0.0/8
  cluster-pool-ipv4-mask-size: "24"
  custom-cni-conf: "false"
  debug: "false"
  disable-cnp-status-updates: "true"
  disable-endpoint-crd: "false"
  enable-auto-protect-node-port-range: "true"
  enable-bandwidth-manager: "false"
  enable-bpf-clock-probe: "true"
  enable-endpoint-health-checking: "true"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-hubble: "true"
  enable-ipv4: "true"
  enable-ipv4-masquerade: "true"
  enable-ipv6: "false"
  enable-ipv6-masquerade: "true"
  enable-k8s-terminating-endpoint: "true"
  enable-l2-neigh-discovery: "true"
  enable-l7-proxy: "true"
  enable-local-redirect-policy: "false"
  enable-policy: default
  enable-remote-node-identity: "true"
  enable-session-affinity: "true"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  hubble-disable-tls: "false"
  hubble-listen-address: :4244
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  identity-allocation-mode: crd
  install-iptables-rules: "true"
  install-no-conntrack-iptables-rules: "false"
  ipam: cluster-pool
  kube-proxy-replacement: disabled
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  node-port-bind-protection: "true"
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  preallocate-bpf-maps: "false"
  remove-cilium-node-taints: "true"
  set-cilium-is-up-condition: "true"
  sidecar-istio-proxy-image: cilium/istio_proxy
  tunnel: vxlan
  unmanaged-pod-watcher-interval: "15"
kind: ConfigMap
metadata:
  creationTimestamp: "2022-07-06T01:24:24Z"
  name: cilium-config
  namespace: kube-system
  resourceVersion: "647857"
  uid: 6d8aef9a-0587-47a9-889c-015f9bdf0375

Sysdump cilium-sysdump-20220706-093835.zip

fixed this error by downgrading kernel from 4.19.0-18-amd64 to 4.19.0-17-amd64

Yes

level=warning msg="2425: (61) r2 = *(u32 *)(r10 -48)" subsys=datapath-loader
level=warning msg="BPF program is too large. Processed 131073 insn" subsys=datapath-loader
level=warning subsys=datapath-loader
level=warning msg="Error filling program arrays!" subsys=datapath-loader
level=warning msg="Unable to load program" subsys=datapath-loader
level=warning msg="JoinEP: Failed to load program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=273 error="Failed to load prog with tc: exit status 1" file-path=273_next/bpf_lxc.o identity=4 ipv4= ipv6= k8sPodName=/ subsys=datapath-loader veth=lxc_health
level=error msg="Error while rewriting endpoint BPF program" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=273 error="Failed to load prog with tc: exit status 1" identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=273 file-path=273_next_fail identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint
level=warning msg="Regeneration of endpoint failed" bpfCompilation=2.617091097s bpfLoadProg=10.181964227s bpfWaitForELF=2.617231992s bpfWriteELF="270.213µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=273 error="Failed to load prog with tc: exit status 1" identity=4 ipv4= ipv6= k8sPodName=/ mapSync="84.707µs" policyCalculation="115.849µs" prepareBuild="722.483µs" proxyConfiguration="16.566µs" proxyPolicyCalculation="27.524µs" proxyWaitForAck=0s reason="updated security labels" subsys=endpoint total=12.806247856s waitingForCTClean=4.754015ms waitingForLock="2.192µs"
level=error msg="endpoint regeneration failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=273 error="Failed to load prog with tc: exit status 1" identity=4 ipv4= ipv6= k8sPodName=/ subsys=endpoint

Thanks!

You can also use quay.io/cilium/cilium-ci:v1.10 if you want to test it before the release.

Previously mentioned fix was backported in 1.10 and would be part of the next release. @Izual750 it would be great If you could test latest 1.10 and check whether it solves your issue.