k3d: [BUG] Pod network failing to start when installing calico operator with k3d v5.2.1

What did you do

  • How was the cluster created?

    • k3d cluster create "k3s-default" --k3s-arg '--flannel-backend=none@server:*'
  • What did you do afterwards? I tried to install the calico or tigera operator onto the cluster with containerIPForwarding enabled.

kubectl apply -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
curl -L https://docs.projectcalico.org/manifests/custom-resources.yaml > k3d-custom-res.yaml
yq e '.spec.calicoNetwork.containerIPForwarding="Enabled"' -i k3d-custom-res.yaml
kubectl apply -f k3d-custom-res.yaml
  • k3d commands?

  • docker commands? docker ps to check running containers docker exec -ti <node> /bin/sh to ssh into a container

  • OS operations (e.g. shutdown/reboot)? Ran linux system cmds (ls, cat, etc…) inside pods and containers

What did you expect to happen

The pod network should be up and running successfully in all namespaces. All pods are in the running state.

Screenshots or terminal output

The calico-nodes are able to run without issue but other containers are stuck in the ContainerCreating state (coredns, metrics, calico-kube-controller)

$ kubectl get pods -A
NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-7dc6bc5777-h5sp7           1/1     Running             0          106s
calico-system     calico-typha-9b59bcc69-w2ml8               1/1     Running             0          83s
calico-system     calico-kube-controllers-78cc777977-8xf5v   0/1     ContainerCreating   0          83s
kube-system       coredns-7448499f4d-8pwtf                   0/1     ContainerCreating   0          106s
kube-system       metrics-server-86cbb8457f-h26x4            0/1     ContainerCreating   0          106s
kube-system       helm-install-traefik-h6qhh                 0/1     ContainerCreating   0          106s
kube-system       helm-install-traefik-crd-8xsxm             0/1     ContainerCreating   0          106s
kube-system       local-path-provisioner-5ff76fc89d-ql55s    0/1     ContainerCreating   0          106s
calico-system     calico-node-6xbq7                          1/1     Running             0          83s

When describing the stuck pods, I see this in its events:

$ kubectl describe pod/calico-kube-controllers-78cc777977-8xf5v -n calico-system

  Warning  FailedCreatePodSandBox  3s                    kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b474a530f7b8727fc101404ebb551135059f5aa359beb50bae176fd05cf2c20d": netplugin failed with no error message: fork/exec /opt/cni/bin/calico: no such file or directory

Based on the error above, I went to check /opt/cni/bin/calico to see if the calico binary existed in the container, which it does:

glen@glen-tigera: $ docker exec -ti k3d-k3s-default-server-0 /bin/sh
/ # ls
bin  dev  etc  k3d  lib  opt  output  proc  run  sbin  sys  tmp  usr  var
/ # cd /opt/cni/bin/
/opt/cni/bin # ls -a
.  ..  bandwidth  **calico**  calico-ipam  flannel  host-local  install  loopback  portmap  tags.txt  tuning

CNI Config Yaml: kubectl get cm cni-config -n calico-system -o yaml

apiVersion: v1
data:
  config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "datastore_type": "kubernetes",
          "mtu": 0,
          "nodename_file_optional": false,
          "log_level": "Info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "ipam": { "type": "calico-ipam", "assign_ipv4" : "true", "assign_ipv6" : "false"},
          "container_settings": {
              "allow_ip_forwarding": true
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "k8s_api_root":"https://10.43.0.1:443",
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        },
        {"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}
      ]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2021-12-17T18:02:24Z"
  name: cni-config
  namespace: calico-system
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: c53d18b5-efc6-4155-879b-6097a8c2c14c
  resourceVersion: "675"
  uid: 003c9cdc-0ef5-4d63-8d30-d6e1ed79d4c0

Which OS & Architecture

OS: GNU/Linux Kernel Version: 20.04.2-Ubuntu SMP Kernel Release: 5.11.0-40-generic Processor/HW Platform/Machine Architecture: x86_64

Which version of k3d

k3d version v5.2.1 k3s version v1.21.7-k3s1 (default)

Which version of docker

docker version:

Client: Docker Engine - Community
 Version:           20.10.11
 API version:       1.41
 Go version:        go1.16.9
 Git commit:        dea9396
 Built:             Thu Nov 18 00:37:06 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.11
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.16.9
  Git commit:       847da18
  Built:            Thu Nov 18 00:35:15 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.12
  GitCommit:        7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc:
  Version:          1.0.2
  GitCommit:        v1.0.2-0-g52b36a2
 docker-init:
  Version:          0.19.0
  GitCommit:        `de40ad0`

docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.6.3-docker)
  scan: Docker Scan (Docker Inc., v0.9.0)

Server:
 Containers: 20
  Running: 0
  Paused: 0
  Stopped: 20
 Images: 22
 Server Version: 20.10.11
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version: v1.0.2-0-g52b36a2
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.11.0-40-generic
 Operating System: Ubuntu 20.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.09GiB
 Name: glen-tigera
 ID: 6EZ7:QGFF:Z2KK:Q7K3:YKGI:6FIS:X2UP:JX5W:UGXA:FIZW:CYV6:RDDU
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Comments: 20 (7 by maintainers)

Most upvoted comments

Do you have any idea what rancher/rke2-charts#20 (files) actually does?

Sorry, just understood how you got there. This is the script that’s being executed: https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker/flexvol.sh

I’m checking the variants of installation now (the one from k3d docs and yours) with regards to the uds:

Via Operator:

/ # ls -lah /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
total 4.8M
drwxr-xr-x 2 0 0 4.0K Dec 21 06:49 .
drwxr-xr-x 3 0 0 4.0K Dec 21 06:49 ..
-r-xr-x--- 1 0 0 4.8M Dec 21 06:49 uds

/ # stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
  File: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
  Size: 4987070   	Blocks: 9744       IO Block: 4096   regular file
Device: 37h/55d	Inode: 43271409    Links: 1
Access: (0550/-r-xr-x---)  Uid: (    0/ UNKNOWN)   Gid: (    0/ UNKNOWN)
Access: 2021-12-21 06:49:35.595982143 +0000
Modify: 2021-12-21 06:49:35.531982019 +0000
Change: 2021-12-21 06:49:35.531982019 +0000
 Birth: -

/ # /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
sh: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: not found

Without Operator:

/ # ls -lah /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
total 5.4M
drwxr-xr-x 2 0 0 4.0K Dec 21 06:52 .
drwxr-xr-x 3 0 0 4.0K Dec 21 06:52 ..
-r-xr-x--- 1 0 0 5.4M Dec 21 06:52 uds
/ # stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds 
  File: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
  Size: 5602363   	Blocks: 10944      IO Block: 4096   regular file
Device: 37h/55d	Inode: 42735669    Links: 1
Access: (0550/-r-xr-x---)  Uid: (    0/ UNKNOWN)   Gid: (    0/ UNKNOWN)
Access: 2021-12-21 06:52:46.752353250 +0000
Modify: 2021-12-21 06:52:46.092351969 +0000
Change: 2021-12-21 06:52:46.100351984 +0000
 Birth: -
/ # /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds 
Usage:
  flexvoldrv [command]

Available Commands:
  help        Help about any command
  init        Flex volume init command.
  mount       Flex volume unmount command.
  unmount     Flex volume unmount command.
  version     Print version

Flags:
  -h, --help   help for flexvoldrv

Use "flexvoldrv [command] --help" for more information about a command.

Googling for that error message, this issue in rke2 popped up: https://github.com/rancher/rke2/issues/234

Ah at least you could track it down to a specific version already 👍 Fingers crossed you’ll figure out the root cause.

Upon further testing, our v3.21 (latest release) operator install seems to no longer be compatible with k3d clusters. I tested the operator starting from v3.15 and every version was working till v3.21. Followed up with the larger team to discuss further.

k3d-calico-operator-install-findings.txt

Ah - the way that the tigera-operator works, there’s a version of operator that maps to a version of calico (since the manifests are baked into it). For v3.15, you’ll want to apply: https://docs.projectcalico.org/archive/v3.15/manifests/tigera-operator.yaml

(the intent is to make the upgrade experience better - in an operator managed cluster, you upgrade calico by simply applying the uplevel tigera-operator.yaml and it takes care of everything). In the old manifest install, you’d have customised your install in various ways directly in the yaml, so to upgrade you have to get the new yaml, then make the same edits as you did before, then apply and hope you did it right. Whereas in an operator setup, you have configured all your customisations in the Installation resource. The new operator reads that and does “the right thing” to apply those customisations.

@iwilltry42 When I ran the command your posted earlier, there was no such file or directory on my setup:

$ docker exec -it k3d-k3s-default-server-0 stat /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds
stat: cannot stat '/usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds': No such file or directory

There is no nodeagent~uds directory when I try to look inside the container:

$ docker exec -ti k3d-k3s-default-server-0 ls -a /usr/libexec/kubernetes/kubelet-plugins/volume/exec
.  ..

From https://projectcalico.docs.tigera.io/reference/installation/api, I think this all means we need to set spec.flexVolumePath: "/usr/local/bin/" in the Installation resource in custom-resources

Awesome, thank you, that gives us a thread to pull on.