calico: pod calico-node on worker nodes with 'CrashLoopBackOff'

Expected Behavior

Pods calico-node on worker nodes should have status ‘Running’

Current Behavior

Pods calico-node on worker nodes are in state ‘CrashLoopBackOff’

vagrant@k8s-master:~$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS             RESTARTS   AGE
kube-system   calico-kube-controllers-59f54d6bbc-w4w6f   1/1     Running            1          29h
kube-system   calico-node-86mdg                          0/1     CrashLoopBackOff   25         29h
kube-system   calico-node-hzcsh                          0/1     CrashLoopBackOff   26         29h
kube-system   calico-node-q267c                          1/1     Running            1          29h
kube-system   coredns-5c98db65d4-m8wls                   1/1     Running            1          29h
kube-system   coredns-5c98db65d4-vdp4f                   1/1     Running            1          29h
kube-system   etcd-k8s-master                            1/1     Running            1          29h
kube-system   kube-apiserver-k8s-master                  1/1     Running            1          29h
kube-system   kube-controller-manager-k8s-master         1/1     Running            1          29h
kube-system   kube-proxy-6f6q9                           1/1     Running            1          29h
kube-system   kube-proxy-prpqv                           1/1     Running            1          29h
kube-system   kube-proxy-qds8x                           1/1     Running            1          29h
kube-system   kube-scheduler-k8s-master                  1/1     Running            1          29h

Steps to Reproduce (for bugs)

Basically, I followed the articel from a k8s blog - Kubernetes Setup Using Ansible and Vagrant with minor modifications (replace bento/ubuntu-16.04 <= generic/ubuntu1604 and use calico v3.8 <= 3.4). See gists here. Note that I also tried manual setup Installing with the Kubernetes API datastore—50 nodes or less and get the same error. In short, I run the following on master node … 1.kubeadm init --apiserver-advertise-address="192.168.50.10" --apiserver-cert-extra-sans="192.168.50.10" --node-name k8s-master --pod-network-cidr=192.168.0.0/16 2.populate ~/.kube/config with /etc/kubernetes/admin.conf 3.curl https://docs.projectcalico.org/v3.8/manifests/calico.yaml -O 4.kubectl apply -f calico.yaml and then join worker nodes with kubeadm join 192.168.50.10:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Context

This issue seems to prevent me to schedule workload to worker nodes. Running kubectl describe pod -n kube-system calico-node-86mdg shows

Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused

Here is the actual output

vagrant@k8s-master:~$ kubectl describe pod -n kube-system calico-node-86mdg
Name:                 calico-node-86mdg
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 node-2/192.168.50.12
Start Time:           Wed, 10 Jul 2019 15:58:22 -0700
Labels:               controller-revision-hash=844ddd97c6
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod:
Status:               Running
IP:                   192.168.50.12
Controlled By:        DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://b53dcaaf8a7cd71b242573c35ab654c83dc5daf5d7a10de1cb42623fe3fca567
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:28 -0700
      Finished:     Thu, 11 Jul 2019 20:31:28 -0700
    Ready:          True
    Restart Count:  1
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
  install-cni:
    Container ID:  docker://e374bb79296a23e83062d9d62cf8ea684e24aa3b634d1ec4948528672a9d18c7
    Image:         calico/cni:v3.8.0
    Image ID:      docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:29 -0700
      Finished:     Thu, 11 Jul 2019 20:31:29 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
  flexvol-driver:
    Container ID:   docker://27fb9711cc45fa2fc07ce9f53a9619329e2fd544df502320c1f11b36f0a9a0e0
    Image:          calico/pod2daemon-flexvol:v3.8.0
    Image ID:       docker-pullable://calico/pod2daemon-flexvol@sha256:6ec8b823e5ce3440318edfcdd2ab8b6660110782713f24f53dac5a3c227afb11
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 11 Jul 2019 20:31:30 -0700
      Finished:     Thu, 11 Jul 2019 20:31:30 -0700
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Containers:
  calico-node:
    Container ID:   docker://01ba662b148242659d4e2f0e43098efc70aefbdd1c0cdaf0619e7410853e2d88
    Image:          calico/node:v3.8.0
    Image ID:       docker-pullable://calico/node@sha256:6679ccc9f19dba3eb084db991c788dc9661ad3b5d5bafaa3379644229dca6b05
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 11 Jul 2019 21:47:46 -0700
      Finished:     Thu, 11 Jul 2019 21:48:56 -0700
    Ready:          False
    Restart Count:  29
    Requests:
      cpu:      250m
    Liveness:   http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Always
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               192.168.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-2t8lm:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-2t8lm
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  beta.kubernetes.io/os=linux
Tolerations:     :NoSchedule
                 :NoExecute
                 CriticalAddonsOnly
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason     Age                   From             Message
  ----     ------     ----                  ----             -------
  Warning  BackOff    13m (x207 over 73m)   kubelet, node-2  Back-off restarting failed container
  Warning  Unhealthy  8m4s (x132 over 77m)  kubelet, node-2  Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused
  Normal   Pulled     3m (x23 over 78m)     kubelet, node-2  Container image "calico/node:v3.8.0" already present on machine

Your Environment

  • Calico version: 3.8
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.15.0
  • Operating System and version: Ubuntu 16.04.6 LTS (Xenial Xerus)
  • Link to your project (optional): (none). (see gist Vagrant/Ansible sources )

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 19 (7 by maintainers)

Most upvoted comments

I had the exact same issue as @ekc with quite similar logging and environment versions and I was running Ubuntu boxes on VirtualBox with 2 interfaces each, my issue was that ETCD is allowing a connection only from the CIDR of the first interface with a default gateway (the NAT one) by default, thus it was giving a “connection refused”. I ended up doing these which solved it:

  1. resetting my cluster and adding the --api-advertise-addresses=<IP of the host only inteface (enp0s8)> to the kubeadm init command, as explained in this issue,
  2. set the IP_AUTODETECTION_METHOD to interface=enp0s8 in the calico.yaml as explained by @tmjd here.

Because you’re using Vagrant I’m wondering if you maybe need to change the IP Autodetection method. I’ve seen it before with Vagrant where the hosts have multiple interfaces and Calico is choosing the wrong one. Here is a link to the reference docs for autodetection https://docs.projectcalico.org/v3.8/reference/node/configuration#interfaceinterface-regex.

I’m not sure if you will be able to download the Calico manifest, update it, and then provide it to Ansible, so you could probably just do the installation like you have been then do a kubectl edit -n kube-system ds calico-node and add the env var IP_AUTODETECTION_METHOD with value interface=eth.* (with the proper interface prefix).

I would imagine that the calico-node on your master is your one healthy calico-node pod, and that would be why you see it listening. The logs you included from calico-node show what looks to be the problem. They are unable to reach 10.96.0.1, dial tcp 10.96.0.1:443: i/o timeout. That indicates that kube-proxy on the node is not working correctly. Kube-proxy is responsible for setting up the kubernetes service API address which I believe 10.96.0.1 to be. You should check the kube-proxy logs on one of the nodes where calico-node is in CrashLoopBackoff. To see the nodes where pods are running you can use kubectl get pods --all-namespaces -o wide. Another thing you can try is doing that same curl command on one of the nodes, I’m guessing it would fail currently.

@tmjd I had the same issue as in original message, your solution with CIDR resolved it. Thanks!

From that it looks like the proper service DNAT rules are present assuming your master is at 192.168.50.10. Does a curl to 192.168.50.10:6443 (using the same path previously use) work?

If that works I’m a little at a loss as to what is wrong here. I expect the curl above will work because the kubelet running on that same node is reaching 192.168.50.10.

I noticed that it looks like the pod IP Cidr 192.168.0.0/16 overlaps with the addresses of your nodes, while I don’t think that will cause a problem maybe it is? If the curl above works then I’d consider changing the pod IP Cidr and see if that is the fix.

@ekc We always recommend that the pod cidr and the host cidr should not overlap. I would not be surprised if that is the problem here. As for the size of the cidr for pods, feel free to change that how you see fit. The ‘block’ size that is handed out to each node is /26 so you should make try to make sure the cidr you use has enough /26 blocks for the number of nodes you will have. This isn’t a hard requirement though. And if one node runs out it can get another block or ‘borrow’ ips from other blocks if no more blocks exist.

Because you’re using Vagrant I’m wondering if you maybe need to change the IP Autodetection method. I’ve seen it before with Vagrant where the hosts have multiple interfaces and Calico is choosing the wrong one. Here is a link to the reference docs for autodetection https://docs.projectcalico.org/v3.8/reference/node/configuration#interfaceinterface-regex.

I’m not sure if you will be able to download the Calico manifest, update it, and then provide it to Ansible, so you could probably just do the installation like you have been then do a kubectl edit -n kube-system ds calico-node and add the env var IP_AUTODETECTION_METHOD with value interface=eth.* (with the proper interface prefix).

Hi I had same problem and this solved it

thanx

To resolve this problem, Reboot all your node cluster and first check in ifconfig list and make sure no CNI plugin (e.g. weave, calico)

Then remove rm -rf /opt/cni/bin rm -rf /etc/cni/net.d

If you init your kubeadm with calico try this kubeadm init --pod-network-cidr=10.0.0.0/16

kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml -> please change in calico yaml cdir: 10.0.0.0/16

Try again, it works now.

That works even more easily with Calico 3.17.0 (and perhaps before, but apparently after your 3.8), as there is no need to edit the manifest anymore, autodetected according to the documentation:

If you are using a different pod CIDR with kubeadm, no changes are required - Calico will automatically detect the CIDR based on the running configuration.

Also, @holosix approach may require to start with kubeadm reset, if you can afford it.

To resolve this problem, Reboot all your node cluster and first check in ifconfig list and make sure no CNI plugin (e.g. weave, calico)

Then remove rm -rf /opt/cni/bin rm -rf /etc/cni/net.d

If you init your kubeadm with calico try this kubeadm init --pod-network-cidr=10.0.0.0/16

kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml -> please change in calico yaml cdir: 10.0.0.0/16

Try again, it works now.