calico: Calico-node randomly fails with "Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout"

We use Kubernetes v1.15.3 on bare metal (installed via kubespray on CentOS 7 and Docker 19.03.3) and Calico v3.10. Unfortunately, we have a problem with calico and can’t figure out what happening.

After node reboot, calico-nodes works correctly for some time (couple days or so) but then they suddenly fail with Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout. The same thing with calicoctl node status, from 10 calls 7-8 will be failed with this error and others will be executed correctly.

I can’t see anything criminal in logs, except this error and have no idea what happening and how I can troubleshoot it. The server I/O is near zero, network connectivity is ok.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 5
Comments: 18 (10 by maintainers)

Commits related to this issue

Merge pull request #2992 from neiljerram/service-loop-flake Fix service loop prevention flake — committed to projectcalico/calico by deleted user 3 years ago

Most upvoted comments

We meet same problem here with calico 3.9.0, calico-node pod status shows 0/1, but local route changes as desired.

[root@master01 ~]# kubectl -n kube-system get po -o wide|grep calico-node
calico-node-2hjzz                          1/1     Running   0          18m    192.168.16.218   node08          <none>           <none>
calico-node-5pn2k                          1/1     Running   0          18m    192.168.16.219   node09          <none>           <none>
calico-node-84x27                          0/1     Running   0          18m    192.168.16.208   master02        <none>           <none>
calico-node-cpgnd                          0/1     Running   0          18m    192.168.16.216   node06          <none>           <none>
calico-node-f4sbx                          0/1     Running   0          18m    192.168.16.207   master01        <none>           <none>
calico-node-lwpmk                          0/1     Running   0          18m    192.168.16.212   node02          <none>           <none>
calico-node-m4z9z                          1/1     Running   0          5m1s   192.168.16.181   devops-node02   <none>           <none>
calico-node-nnklk                          1/1     Running   0          18m    192.168.16.217   node07          <none>           <none>
calico-node-qgflg                          0/1     Running   0          17m    192.168.16.211   node01          <none>           <none>
calico-node-qlrgd                          1/1     Running   0          18m    192.168.16.209   master03        <none>           <none>
calico-node-rq9nx                          0/1     Running   0          18m    192.168.16.214   node04          <none>           <none>
calico-node-vjbh6                          1/1     Running   0          5m1s   192.168.16.180   devops-node01   <none>           <none>
calico-node-wkprd                          1/1     Running   0          18m    192.168.16.215   node05          <none>           <none>
calico-node-wzv6m                          1/1     Running   0          18m    192.168.16.220   node10          <none>           <none>
calico-node-x2svs                          1/1     Running   0          18m    192.168.16.213   node03          <none>           <none>

[root@master01 ~]# kubectl get po --all-namespaces |wc -l
9937
# we have shallow node here, the real pods about 800 on real nodes. And calico-node runs on real nodes.

[root@master01 ~]# ll /var/run/calico/bird.ctl
srw-rw---- 1 root root 0 Dec 25 14:47 /var/run/calico/bird.ctl
[root@master01 ~]# nc -U /var/run/calico/bird.ctl
hang
[root@master01 ~]# nc -U /var/run/calico/bird6.ctl
0001 BIRD v0.3.3+birdv1.6.3 ready.

[root@master01 ~]# ./calicoctl node status
Calico process is running.

IPv4 BGP status
Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout

Describe calico-node pod event shows

  Warning  Unhealthy  3m28s (x3 over 3m48s)  kubelet, node10    Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  106s (x10 over 3m16s)  kubelet, node10    Readiness probe failed: calico/node is not ready: BIRD is not ready: Error executing command: read unix @->/var/run/calico/bird.ctl: i/o timeout

Bird logs no errors

# tail -f current
2019-12-25_07:01:15.22783 bird: Mesh_192_168_16_180: State changed to start
2019-12-25_07:01:15.22792 bird: Mesh_192_168_16_181: State changed to start
2019-12-25_07:01:20.05016 bird: Mesh_192_168_16_180: State changed to feed
2019-12-25_07:01:20.05022 bird: Mesh_192_168_16_180: State changed to up
2019-12-25_07:01:38.41523 bird: I/O loop cycle took 16893 ms for 21 events
2019-12-25_07:01:38.41939 bird: Mesh_192_168_16_181: State changed to feed
2019-12-25_07:01:38.41942 bird: Mesh_192_168_16_181: State changed to up
2019-12-25_07:01:52.63615 bird: I/O loop cycle took 13216 ms for 11 events
2019-12-25_07:02:37.56742 bird: I/O loop cycle took 43929 ms for 47 events
2019-12-25_07:02:49.55021 bird: I/O loop cycle took 10606 ms for 10 events
2019-12-25_07:03:13.16625 bird: I/O loop cycle took 22633 ms for 25 events
2019-12-25_07:03:29.74717 bird: Reconfiguration requested by SIGHUP
2019-12-25_07:03:29.74893 bird: Reconfiguring
2019-12-25_07:03:29.74895 bird: Reloading protocol kernel1
2019-12-25_07:03:29.74895 bird: device1: Reconfigured
2019-12-25_07:03:29.74895 bird: direct1: Reconfigured
2019-12-25_07:03:29.74895 bird: Mesh_192_168_16_180: Reconfigured
2019-12-25_07:03:29.74896 bird: Mesh_192_168_16_181: Reconfigured
2019-12-25_07:03:29.74896 bird: Mesh_192_168_16_207: Reconfigured
2019-12-25_07:03:29.74896 bird: Mesh_192_168_16_208: Reconfigured
2019-12-25_07:03:29.74896 bird: Mesh_192_168_16_209: Reconfigured
2019-12-25_07:03:29.74897 bird: Mesh_192_168_16_211: Reconfigured
2019-12-25_07:03:29.74897 bird: Mesh_192_168_16_212: Reconfigured
2019-12-25_07:03:29.74897 bird: Mesh_192_168_16_213: Reconfigured
2019-12-25_07:03:29.74897 bird: Mesh_192_168_16_214: Reconfigured
2019-12-25_07:03:29.74897 bird: Mesh_192_168_16_215: Reconfigured
2019-12-25_07:03:29.74898 bird: Mesh_192_168_16_216: Reconfigured
2019-12-25_07:03:29.74898 bird: Mesh_192_168_16_217: Reconfigured
2019-12-25_07:03:29.74898 bird: Mesh_192_168_16_218: Reconfigured
2019-12-25_07:03:29.74898 bird: Mesh_192_168_16_219: Reconfigured
2019-12-25_07:03:29.74899 bird: Reconfigured
2019-12-25_07:03:29.95653 bird: I/O loop cycle took 15774 ms for 20 events
2019-12-25_07:03:45.86606 bird: I/O loop cycle took 15634 ms for 17 events
2019-12-25_07:03:53.82893 bird: I/O loop cycle took 6969 ms for 7 events
2019-12-25_07:04:21.46117 bird: I/O loop cycle took 26629 ms for 27 events
2019-12-25_07:04:42.19941 bird: I/O loop cycle took 19734 ms for 21 events
2019-12-25_07:04:49.57519 bird: I/O loop cycle took 6371 ms for 7 events
2019-12-25_07:05:09.49816 bird: I/O loop cycle took 19028 ms for 23 events
2019-12-25_07:05:41.92265 bird: I/O loop cycle took 31668 ms for 31 events
2019-12-25_07:05:57.49262 bird: I/O loop cycle took 14564 ms for 16 events
2019-12-25_07:06:16.14394 bird: I/O loop cycle took 17648 ms for 18 events
2019-12-25_07:06:32.22051 bird: I/O loop cycle took 15075 ms for 14 events
2019-12-25_07:06:55.98058 bird: I/O loop cycle took 22757 ms for 25 events
2019-12-25_07:07:07.71316 bird: I/O loop cycle took 10731 ms for 11 events

[root@master01 ~]# netstat -anlp|grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.217:35214    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:179      192.168.16.218:39852    ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.212:38152    ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.208:45940    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:179      192.168.16.220:48805    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:45392    192.168.16.180:179      ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.216:34553    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:179      192.168.16.213:52409    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:179      192.168.16.219:47170    ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.211:52727    ESTABLISHED 31107/bird
tcp        0      0 192.168.16.207:179      192.168.16.214:45433    ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.209:36964    ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:37388    192.168.16.181:179      ESTABLISHED 31107/bird
tcp       19      0 192.168.16.207:179      192.168.16.215:39181    ESTABLISHED 31107/bird

halfcrazy on Dec 25, 2019

No, network interactions seem running correctly.

strangeman on Jan 3, 2020

Just to check, is there actually a problem with networking on these clusters when in this state? It sounds like routes are being programmed correctly, just that the Calico readiness checks start failing with this error message?

caseydavenport on Jan 2, 2020

Another interesting thing: I can easily connect to bird6 socket, but cannot connect to bird4 (via nc or birdcl)

$ nc -U /var/run/calico/bird.ctl
Ncat: Invalid argument.

$ nc -U /var/run/calico/bird6.ctl
0001 BIRD v0.3.3+birdv1.6.7 ready.
^C

strangeman on Nov 21, 2019