calico: calico-node unexplained readiness/liveness probe fails

Hi,

Our calico-node pods on k8s suffer from Liveness/Readiness failed errors continuously. We can reproduce the case when scaling up or scaling down the deployments.

If we set the Liveness/Readiness timeouts to 60 seconds then we can prevent the fails. But we are trying to find the reason of the fails for Liveness/Readiness probe checks. When set timeout to 1 seconds we get too much fails. Also time to time this failed checks cause to restart of calico-node pods.

We have tried tried Debian 9, Ubuntu 20.04, Centos 7 as cluster OS but thought that is not related to OS. We also encounter the same fails rarely on the clusters which are installed to vmware infrastructure. The clusters which are located over vmware infrastructe also set 1 seconds for probe timeouts.

Example failed events output:

# kubectl get events --sort-by='.lastTimestamp' -n kube-system
LAST SEEN   TYPE      REASON         OBJECT                                           MESSAGE
2m5s        Warning   Unhealthy      pod/calico-node-j7v9v                            Readiness probe failed:
2m4s        Warning   Unhealthy      pod/calico-node-6hfj4                            Readiness probe failed:
2m4s        Warning   Unhealthy      pod/calico-node-6hfj4                            Liveness probe failed:
98s         Warning   Unhealthy      pod/calico-node-2bkx2                            Liveness probe failed:
97s         Warning   Unhealthy      pod/calico-node-2bkx2                            Readiness probe failed:
11s         Warning   Unhealthy      pod/calico-node-7qhrh                            Readiness probe failed:

Expected Behavior

There should not too much Liveness/Readiness probe fails.

Current Behavior

There exist too much Liveness/Readiness probe fails.

Steps to Reproduce (for bugs)

Setup k8s cluster with kubespray on OpenStack environment.
Apply some deployments
Monitor the kubectl events.
There exist too much Liveness/Readiness probe fails.

Context

We got random connection errors from our pods and thought that is related this.

Your Environment

Infrastructure: OpenStack Victoria Release
Calico version: v3.20.2
Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes v1.22.3 (installed via kubespray 2.17.1)
Operating System and version: Ubuntu 20.04.3 LTS (Kernel: 5.4.0-90-generic)
Container RunTime: containerd://1.4.11

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (10 by maintainers)

Most upvoted comments

In v3.20 we increased probe timeouts from 1s to 10s. Part of the reason is that in k8s 1.20, exec probe timeouts were fixed to actually respect the timeout values.

A 1s timeout is too low for loaded clusters: https://github.com/projectcalico/calico/pull/4788

What is the load like on your nodes? Can you try a value between 1s and 60s?

lmm on Nov 30, 2021

My question is how do I track that the lack of typha is causing issues in my cluster right now.

You can look at load on the API server pods, which would be the primary indicator of needing Typha.

caseydavenport on Jun 14, 2022

You said that if even a single peer is established, the probe should succeed. Why do you think so?

Because that’s how the code works - we wait 30 seconds for all peers to establish, after which we reduce the requirement to a single node. it’s explained a bit in the comment in the code here: https://github.com/projectcalico/calico/blob/master/node/pkg/health/health.go#L198-L201

That said, the problem is failing with no output which suggests it it’s not an explicit failure, rather an implicit failure due to failing to respond to the probe or something like that.

Any thoughts on how I can debug further?

Looks like both readiness and liveness are failing - could you share what those probes look like in your cluster?

My best guess is that you just need to increase the cpu / memory limits on your calico/node pods for a cluster of your size, because increasing the timeout reduces the flakes suggesting resource starvation for the calico/node pods. (but not necessarily the entire node). Also, given you’re running a cluster of 50+ nodes, you should make sure that you have Typha installed and that you’re using either VXLAN or BGP with RR or peering to your top of rack (anything but full mesh, which performs poorly at around the 50-100 node mark).

caseydavenport on Feb 18, 2022

@caseydavenport @yusufgungor I am observing something similar. Due to rapid autoscaling, my calico pods face probe failures because Bird is not able to connect to all its BGP peers.

tarun-asthana on Jan 12, 2022