calico: calico-node unexplained readiness/liveness probe fails
Hi,
Our calico-node pods on k8s suffer from Liveness/Readiness failed errors continuously. We can reproduce the case when scaling up or scaling down the deployments.
If we set the Liveness/Readiness timeouts to 60 seconds then we can prevent the fails. But we are trying to find the reason of the fails for Liveness/Readiness probe checks. When set timeout to 1 seconds we get too much fails. Also time to time this failed checks cause to restart of calico-node pods.
We have tried tried Debian 9, Ubuntu 20.04, Centos 7 as cluster OS but thought that is not related to OS. We also encounter the same fails rarely on the clusters which are installed to vmware infrastructure. The clusters which are located over vmware infrastructe also set 1 seconds for probe timeouts.
Example failed events output:
# kubectl get events --sort-by='.lastTimestamp' -n kube-system
LAST SEEN TYPE REASON OBJECT MESSAGE
2m5s Warning Unhealthy pod/calico-node-j7v9v Readiness probe failed:
2m4s Warning Unhealthy pod/calico-node-6hfj4 Readiness probe failed:
2m4s Warning Unhealthy pod/calico-node-6hfj4 Liveness probe failed:
98s Warning Unhealthy pod/calico-node-2bkx2 Liveness probe failed:
97s Warning Unhealthy pod/calico-node-2bkx2 Readiness probe failed:
11s Warning Unhealthy pod/calico-node-7qhrh Readiness probe failed:
Expected Behavior
There should not too much Liveness/Readiness probe fails.
Current Behavior
There exist too much Liveness/Readiness probe fails.
Steps to Reproduce (for bugs)
- Setup k8s cluster with kubespray on OpenStack environment.
- Apply some deployments
- Monitor the kubectl events.
- There exist too much Liveness/Readiness probe fails.
Context
We got random connection errors from our pods and thought that is related this.
Your Environment
- Infrastructure: OpenStack Victoria Release
- Calico version: v3.20.2
- Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes v1.22.3 (installed via kubespray 2.17.1)
- Operating System and version: Ubuntu 20.04.3 LTS (Kernel: 5.4.0-90-generic)
- Container RunTime: containerd://1.4.11
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (10 by maintainers)
In v3.20 we increased probe timeouts from 1s to 10s. Part of the reason is that in k8s 1.20, exec probe timeouts were fixed to actually respect the timeout values.
A 1s timeout is too low for loaded clusters: https://github.com/projectcalico/calico/pull/4788
What is the load like on your nodes? Can you try a value between 1s and 60s?
You can look at load on the API server pods, which would be the primary indicator of needing Typha.
Because that’s how the code works - we wait 30 seconds for all peers to establish, after which we reduce the requirement to a single node. it’s explained a bit in the comment in the code here: https://github.com/projectcalico/calico/blob/master/node/pkg/health/health.go#L198-L201
That said, the problem is failing with no output which suggests it it’s not an explicit failure, rather an implicit failure due to failing to respond to the probe or something like that.
Looks like both readiness and liveness are failing - could you share what those probes look like in your cluster?
My best guess is that you just need to increase the cpu / memory limits on your calico/node pods for a cluster of your size, because increasing the timeout reduces the flakes suggesting resource starvation for the calico/node pods. (but not necessarily the entire node). Also, given you’re running a cluster of 50+ nodes, you should make sure that you have Typha installed and that you’re using either VXLAN or BGP with RR or peering to your top of rack (anything but full mesh, which performs poorly at around the 50-100 node mark).
@caseydavenport @yusufgungor I am observing something similar. Due to rapid autoscaling, my calico pods face probe failures because Bird is not able to connect to all its BGP peers.