calico: CrashLoopBackOff on calico pods caused by loopback interface being down

We run a Kubernetes cluster with calico and sporadically the cluster goes down (~once a month). The calico pod on one of the nodes goes into perpetual CrashLoopBackOff state and the node’s loopback interface is down (can’t even ping itself).

One thing to note is that we run around ~200 pods in that node.

Expected Behavior

The cluster should be stable.

Current Behavior

ip address show lo output on the node

lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state DOWN group default qlen 1000

calicoctl node status on the node

$ sudo calicoctl node status
Calico process is not running.

Possible Solution

What we do to fix this is to run ip link set dev lo up manually enabling the loopback interface and restarting the pod.

Steps to Reproduce (for bugs)

The issue pops up sporadically.

Context

We have a guess that this has probably something to do with us running more than the limit of 110 pods on a node.

Your Environment

Calico version 3.17.0
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes 1.19.4
Operating System and version: Ubuntu 18.04.4 LTS (GNU/Linux 4.15.0-126-generic x86_64)
Link to your project (optional): https://github.com/gesiscss/orc

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 6
Comments: 17 (10 by maintainers)

Most upvoted comments

@fasaxc similar problem might exist for contained : https://github.com/containerd/containerd/issues/5630

sedflix on Mar 25, 2022