calico: Calico-node randomly fails with "Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout"
We use Kubernetes v1.15.3 on bare metal (installed via kubespray on CentOS 7 and Docker 19.03.3) and Calico v3.10. Unfortunately, we have a problem with calico and can’t figure out what happening.
After node reboot, calico-nodes works correctly for some time (couple days or so) but then they suddenly fail with Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout. The same thing with calicoctl node status, from 10 calls 7-8 will be failed with this error and others will be executed correctly.
I can’t see anything criminal in logs, except this error and have no idea what happening and how I can troubleshoot it. The server I/O is near zero, network connectivity is ok.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 18 (10 by maintainers)
Commits related to this issue
- Merge pull request #2992 from neiljerram/service-loop-flake Fix service loop prevention flake — committed to projectcalico/calico by deleted user 3 years ago
We meet same problem here with calico 3.9.0, calico-node pod status shows 0/1, but local route changes as desired.
Describe calico-node pod event shows
Bird logs no errors
No, network interactions seem running correctly.
Just to check, is there actually a problem with networking on these clusters when in this state? It sounds like routes are being programmed correctly, just that the Calico readiness checks start failing with this error message?
Another interesting thing: I can easily connect to bird6 socket, but cannot connect to bird4 (via nc or birdcl)