calico: pod calico-node on worker nodes with 'CrashLoopBackOff'
Expected Behavior
Pods calico-node on worker nodes should have status ‘Running’
Current Behavior
Pods calico-node on worker nodes are in state ‘CrashLoopBackOff’
vagrant@k8s-master:~$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-59f54d6bbc-w4w6f 1/1 Running 1 29h
kube-system calico-node-86mdg 0/1 CrashLoopBackOff 25 29h
kube-system calico-node-hzcsh 0/1 CrashLoopBackOff 26 29h
kube-system calico-node-q267c 1/1 Running 1 29h
kube-system coredns-5c98db65d4-m8wls 1/1 Running 1 29h
kube-system coredns-5c98db65d4-vdp4f 1/1 Running 1 29h
kube-system etcd-k8s-master 1/1 Running 1 29h
kube-system kube-apiserver-k8s-master 1/1 Running 1 29h
kube-system kube-controller-manager-k8s-master 1/1 Running 1 29h
kube-system kube-proxy-6f6q9 1/1 Running 1 29h
kube-system kube-proxy-prpqv 1/1 Running 1 29h
kube-system kube-proxy-qds8x 1/1 Running 1 29h
kube-system kube-scheduler-k8s-master 1/1 Running 1 29h
Steps to Reproduce (for bugs)
Basically, I followed the articel from a k8s blog - Kubernetes Setup Using Ansible and Vagrant with minor modifications (replace bento/ubuntu-16.04 <= generic/ubuntu1604 and use calico v3.8 <= 3.4). See gists here. Note that I also tried manual setup Installing with the Kubernetes API datastore—50 nodes or less and get the same error.
In short, I run the following on master node …
1.kubeadm init --apiserver-advertise-address="192.168.50.10" --apiserver-cert-extra-sans="192.168.50.10" --node-name k8s-master --pod-network-cidr=192.168.0.0/16
2.populate ~/.kube/config
with /etc/kubernetes/admin.conf
3.curl https://docs.projectcalico.org/v3.8/manifests/calico.yaml -O
4.kubectl apply -f calico.yaml
and then join worker nodes with kubeadm join 192.168.50.10:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Context
This issue seems to prevent me to schedule workload to worker nodes.
Running kubectl describe pod -n kube-system calico-node-86mdg
shows
Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused
Here is the actual output
vagrant@k8s-master:~$ kubectl describe pod -n kube-system calico-node-86mdg
Name: calico-node-86mdg
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: node-2/192.168.50.12
Start Time: Wed, 10 Jul 2019 15:58:22 -0700
Labels: controller-revision-hash=844ddd97c6
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 192.168.50.12
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://b53dcaaf8a7cd71b242573c35ab654c83dc5daf5d7a10de1cb42623fe3fca567
Image: calico/cni:v3.8.0
Image ID: docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jul 2019 20:31:28 -0700
Finished: Thu, 11 Jul 2019 20:31:28 -0700
Ready: True
Restart Count: 1
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
install-cni:
Container ID: docker://e374bb79296a23e83062d9d62cf8ea684e24aa3b634d1ec4948528672a9d18c7
Image: calico/cni:v3.8.0
Image ID: docker-pullable://calico/cni@sha256:decba0501ab0658e6e7da2f5625f1eabb8aba5690f9206caba3bf98caca5094c
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jul 2019 20:31:29 -0700
Finished: Thu, 11 Jul 2019 20:31:29 -0700
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
flexvol-driver:
Container ID: docker://27fb9711cc45fa2fc07ce9f53a9619329e2fd544df502320c1f11b36f0a9a0e0
Image: calico/pod2daemon-flexvol:v3.8.0
Image ID: docker-pullable://calico/pod2daemon-flexvol@sha256:6ec8b823e5ce3440318edfcdd2ab8b6660110782713f24f53dac5a3c227afb11
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 11 Jul 2019 20:31:30 -0700
Finished: Thu, 11 Jul 2019 20:31:30 -0700
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Containers:
calico-node:
Container ID: docker://01ba662b148242659d4e2f0e43098efc70aefbdd1c0cdaf0619e7410853e2d88
Image: calico/node:v3.8.0
Image ID: docker-pullable://calico/node@sha256:6679ccc9f19dba3eb084db991c788dc9661ad3b5d5bafaa3379644229dca6b05
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 11 Jul 2019 21:47:46 -0700
Finished: Thu, 11 Jul 2019 21:48:56 -0700
Ready: False
Restart Count: 29
Requests:
cpu: 250m
Liveness: http-get http://localhost:9099/liveness delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Always
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 192.168.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-2t8lm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-2t8lm:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-2t8lm
Optional: false
QoS Class: Burstable
Node-Selectors: beta.kubernetes.io/os=linux
Tolerations: :NoSchedule
:NoExecute
CriticalAddonsOnly
node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/pid-pressure:NoSchedule
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 13m (x207 over 73m) kubelet, node-2 Back-off restarting failed container
Warning Unhealthy 8m4s (x132 over 77m) kubelet, node-2 Liveness probe failed: Get http://localhost:9099/liveness: dial tcp 127.0.0.1:9099: connect: connection refused
Normal Pulled 3m (x23 over 78m) kubelet, node-2 Container image "calico/node:v3.8.0" already present on machine
Your Environment
- Calico version: 3.8
- Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.15.0
- Operating System and version: Ubuntu 16.04.6 LTS (Xenial Xerus)
- Link to your project (optional): (none). (see gist Vagrant/Ansible sources )
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 12
- Comments: 19 (7 by maintainers)
I had the exact same issue as @ekc with quite similar logging and environment versions and I was running Ubuntu boxes on VirtualBox with 2 interfaces each, my issue was that ETCD is allowing a connection only from the CIDR of the first interface with a default gateway (the NAT one) by default, thus it was giving a “connection refused”. I ended up doing these which solved it:
--api-advertise-addresses=<IP of the host only inteface (enp0s8)>
to thekubeadm init
command, as explained in this issue,IP_AUTODETECTION_METHOD
tointerface=enp0s8
in thecalico.yaml
as explained by @tmjd here.Because you’re using Vagrant I’m wondering if you maybe need to change the IP Autodetection method. I’ve seen it before with Vagrant where the hosts have multiple interfaces and Calico is choosing the wrong one. Here is a link to the reference docs for autodetection https://docs.projectcalico.org/v3.8/reference/node/configuration#interfaceinterface-regex.
I’m not sure if you will be able to download the Calico manifest, update it, and then provide it to Ansible, so you could probably just do the installation like you have been then do a
kubectl edit -n kube-system ds calico-node
and add the env var IP_AUTODETECTION_METHOD with valueinterface=eth.*
(with the proper interface prefix).I would imagine that the calico-node on your master is your one healthy calico-node pod, and that would be why you see it listening. The logs you included from calico-node show what looks to be the problem. They are unable to reach 10.96.0.1,
dial tcp 10.96.0.1:443: i/o timeout
. That indicates that kube-proxy on the node is not working correctly. Kube-proxy is responsible for setting up the kubernetes service API address which I believe 10.96.0.1 to be. You should check the kube-proxy logs on one of the nodes where calico-node is in CrashLoopBackoff. To see the nodes where pods are running you can usekubectl get pods --all-namespaces -o wide
. Another thing you can try is doing that same curl command on one of the nodes, I’m guessing it would fail currently.@tmjd I had the same issue as in original message, your solution with CIDR resolved it. Thanks!
From that it looks like the proper service DNAT rules are present assuming your master is at 192.168.50.10. Does a curl to 192.168.50.10:6443 (using the same path previously use) work?
If that works I’m a little at a loss as to what is wrong here. I expect the curl above will work because the kubelet running on that same node is reaching 192.168.50.10.
I noticed that it looks like the pod IP Cidr 192.168.0.0/16 overlaps with the addresses of your nodes, while I don’t think that will cause a problem maybe it is? If the curl above works then I’d consider changing the pod IP Cidr and see if that is the fix.
@ekc We always recommend that the pod cidr and the host cidr should not overlap. I would not be surprised if that is the problem here. As for the size of the cidr for pods, feel free to change that how you see fit. The ‘block’ size that is handed out to each node is /26 so you should make try to make sure the cidr you use has enough /26 blocks for the number of nodes you will have. This isn’t a hard requirement though. And if one node runs out it can get another block or ‘borrow’ ips from other blocks if no more blocks exist.
Hi I had same problem and this solved it
thanx
That works even more easily with Calico 3.17.0 (and perhaps before, but apparently after your 3.8), as there is no need to edit the manifest anymore, autodetected according to the documentation:
Also, @holosix approach may require to start with
kubeadm reset
, if you can afford it.To resolve this problem, Reboot all your node cluster and first check in
ifconfig
list and make sure no CNI plugin (e.g. weave, calico)Then remove
rm -rf /opt/cni/bin
rm -rf /etc/cni/net.d
If you init your kubeadm with calico try this
kubeadm init --pod-network-cidr=10.0.0.0/16
kubectl apply -f https://docs.projectcalico.org/v3.8/manifests/calico.yaml
-> please change in calico yaml cdir: 10.0.0.0/16Try again, it works now.