calico: CrashLoopBackoff with mount-bpffs init container, calico v3.23.2, operator v1.27.7

I’m currently spinning up fresh talos linux clusters. I’m install the CNI layer from the manifests at https://projectcalico.docs.tigera.io/archive/v3.23/manifests/tigera-operator.yaml which uses the operator version v1.27.7 since a day or two ago, which install calico v3.23.2.

Along with the CNI manifest above, I’m also adding the following inlineManifests. (Replace kubaapi_ip, and pod_subnet vars.).

apiVersion: crd.projectcalico.org/v1
kind: FelixConfiguration
metadata:
  name: default
spec:
  wireguardEnabled: true
---
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    linuxDataplane: BPF
    ipPools:
      - blockSize: 26
        cidr: {{ pod_subnet }}
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

Deploying this results in the following errors in the mount-bpffs container:

kubectl logs --container=mount-bpffs -n calico-system calico-node-hmv2l
2022-06-27 03:20:48.241 [INFO][1] init/startup.go 425: Early log level set to info
2022-06-27 03:20:48.242 [INFO][1] init/calico-init_linux.go 58: Checking if BPF filesystem is mounted.
2022-06-27 03:20:48.242 [INFO][1] init/calico-init_linux.go 70: BPF filesystem is mounted.
2022-06-27 03:20:48.243 [INFO][1] init/calico-init_linux.go 95: Checking if Cgroup2 filesystem is mounted.
2022-06-27 03:20:48.243 [ERROR][1] init/calico-init_linux.go 49: Failed to mount cgroup2 filesystem. error=failed to open /initproc/mountinfo: open /initproc/mountinfo: no such file or directory

Looking deeper into this, it seems that the mount changes recently introduced by https://github.com/projectcalico/calico/pull/6240/files, have not made it into the release-1.27 branch of the operator, but are on master via https://github.com/tigera/operator/pull/1957.

I then switched to installing the CNI via https://raw.githubusercontent.com/projectcalico/calico/master/manifests/tigera-operator.yaml, which uses the master version of the operator. This fails with a slight different message:

2022-06-27 02:39:41.972 [INFO][1] init/startup.go 425: Early log level set to info
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 58: Checking if BPF filesystem is mounted.
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 70: BPF filesystem is mounted.
2022-06-27 02:39:41.973 [INFO][1] init/calico-init_linux.go 95: Checking if Cgroup2 filesystem is mounted.
2022-06-27 02:39:41.974 [INFO][1] init/calico-init_linux.go 123: Cgroup2 filesystem is not mounted. Trying to mount it...
2022-06-27 02:39:41.984 [ERROR][1] init/calico-init_linux.go 128: Mouting cgroup2 fs failed. output: [84 114 121 105 110 103 32 116 111 32 109 111 117 110 116 32 114 111 111 116 32 99 103 114 111 117 112 32 102 115 46 10 70 97 105 108 101 100 32 116 111 32 109 111 117 110 116 32 67 103 114 111 117 112 32 102 105 108 101 115 121 115 116 101 109 46 32 101 114 114 58 32 110 111 32 115 117 99 104 32 102 105 108 101 32 111 114 32 100 105 114 101 99 116 111 114 121 10]
2022-06-27 02:39:41.985 [ERROR][1] init/calico-init_linux.go 49: Failed to mount cgroup2 filesystem. error=failed to mount cgroup2 filesystem: exit status 1

The string of numbers appears to be decimal encoded ascii, which decodes to

Trying to mount root cgroup fs.
Failed to mount Cgroup filesystem. err: no such file or directory

Not sure why it’s output as it is.

In summary, there might be two errors in this issue. 1. the current operator version installs a version of calico for which the manifests do not appear to be correctly generated for, and 2, I can’t actually get the calico-node pods to start as the mount-bpffs init container fails to start, even using versions where I believe this should be supported. I’m not sure if I shoudl’ve made the issue here or in the operator repo.

Expected Behavior

The calico node pods would start.

Current Behavior

The calico node pods do not start.

Possible Solution

Steps to Reproduce (for bugs)

  1. Create patch.yaml
- op: add
  path: /cluster/network/cni
  value:
    name: custom
    urls:
      - https://raw.githubusercontent.com/projectcalico/calico/master/manifests/tigera-operator.yaml

- op: replace
  path: /cluster/inlineManifests
  value:
    - name: calico-configuration
      contents: |
        apiVersion: crd.projectcalico.org/v1
        kind: FelixConfiguration
        metadata:
          name: default
        spec:
          wireguardEnabled: true
        ---
        apiVersion: operator.tigera.io/v1
        kind: Installation
        metadata:
          name: default
        spec:
          calicoNetwork:
            linuxDataplane: BPF
            ipPools:
              - blockSize: 26
                cidr: 10.244.0.0/16
                encapsulation: VXLANCrossSubnet
                natOutgoing: Enabled
                nodeSelector: all()
        ---
        apiVersion: operator.tigera.io/v1
        kind: APIServer
        metadata:
          name: default
        spec: {}
  1. Create patch-cp.yaml
- op: remove
  path: /cluster/apiServer/admissionControl
  1. wget https://github.com/siderolabs/talos/releases/download/v1.1.0/talosctl-linux-amd64
  2. chmod +x talosctl-linux-amd64 4../talosctl-linux-amd64 cluster create --skip-kubeconfig --config-patch @patch.yaml --config-patch-control-plane @patch-cp.yaml
  3. The above command will never complete as the cluster never becomes healthy due to calico-node crashloop backoff. Use the following to investigate further: ./talosctl-linux-amd64 --nodes 10.5.0.2 kubeconfig kc kubectl --kubeconfig kc logs --all-containers -n calico-system -l app.kubernetes.io/name=calico-node

Switching out the CNI URL in patch.yaml will switch between master and v1.27.7 of the operator.

This will create a KIND cluster, so docker will be needed on the test machine. The following will cleanup any setup that was made. (CTRL+C to exit cluster creation process). ./talosctl-linux-amd64 cluster destroy You may also wish to delete ~/.talos as well, as this directory is also created.

I don’t believe this is specific to talos, but that is my current setup.

Reverting to the operator vers v1.27.5 and everything “works” again.

Context

I’m currently spinning this up in a test cluster, but am concerned what would happen if I did this in the production cluster.

Your Environment

  • Calico version
  • Orchestrator version (e.g. kubernetes, mesos, rkt):
  • Operating System and version:
  • Link to your project (optional):

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 29 (16 by maintainers)

Most upvoted comments

is that possible to put tigera-operator.yaml under each release as a separate file?

This is a separate enhancement, but will be available starting in v3.24.0. If you want to use the manifest from an earlier release, unfortunately you need to download the tgz I linked above.

+1 fix didn’t help. I’ve installed a kubernetes cluster with kubeadm and then installed calico with operator mode. After switching to eBPF mode - get the same error

@mazdakn merged it. Will cut a new operator release ASAP.