calico: calico-node pods are failing after upgrade from 3.22 to 3.23: felix is not ready: readiness probe reporting 503

Expected Behavior

Calico is working after upgrade to version 3.23.

Current Behavior

calico-node pods are failing after upgrade from 3.22 to 3.23:

$ kubectl describe po calico-node-pg52g
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  2m35s                  default-scheduler  Successfully assigned kube-system/calico-node-pg52g to dev-k8s-worker-r1
  Normal   Pulling    2m35s                  kubelet            Pulling image "docker.io/calico/cni:v3.23.3"
  Normal   Pulled     2m31s                  kubelet            Successfully pulled image "docker.io/calico/cni:v3.23.3" in 3.507665044s
  Normal   Created    2m31s                  kubelet            Created container install-cni
  Normal   Started    2m31s                  kubelet            Started container install-cni
  Normal   Pulling    2m28s                  kubelet            Pulling image "docker.io/calico/node:v3.23.3"
  Normal   Pulled     2m24s                  kubelet            Successfully pulled image "docker.io/calico/node:v3.23.3" in 3.680538035s
  Normal   Created    2m24s                  kubelet            Created container mount-bpffs
  Normal   Started    2m24s                  kubelet            Started container mount-bpffs
  Normal   Pulled     2m23s                  kubelet            Container image "docker.io/calico/node:v3.23.3" already present on machine
  Normal   Created    2m23s                  kubelet            Created container calico-node
  Normal   Started    2m23s                  kubelet            Started container calico-node
  Warning  Unhealthy  2m21s (x2 over 2m22s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  2m15s                  kubelet            Readiness probe failed: 2022-07-27 18:18:21.875 [INFO][438] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  2m5s  kubelet  Readiness probe failed: 2022-07-27 18:18:31.882 [INFO][665] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  115s  kubelet  Readiness probe failed: 2022-07-27 18:18:41.881 [INFO][975] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  105s  kubelet  Readiness probe failed: 2022-07-27 18:18:51.904 [INFO][1252] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  95s  kubelet  Readiness probe failed: 2022-07-27 18:19:01.907 [INFO][1565] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  85s  kubelet  Readiness probe failed: 2022-07-27 18:19:11.894 [INFO][1810] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  79s  kubelet  Readiness probe failed: 2022-07-27 18:19:17.203 [INFO][2018] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  75s  kubelet  Readiness probe failed: 2022-07-27 18:19:21.868 [INFO][2146] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  43s (x4 over 65s)  kubelet  (combined from similar events): Readiness probe failed: 2022-07-27 18:19:51.889 [INFO][3017] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503

The only logs that are not INFO:

2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.64
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.78
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.79
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.80
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.151.192
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.161.192
2022-07-27 18:19:21.026 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.182.192
2022-07-27 18:19:21.026 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.183.192
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.187.64
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.24.192
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.25.0
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.50.64
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.0
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.4
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.5
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.64.64
2022-07-27 18:19:21.066 [WARNING][2094] felix/daemon.go 1209: IPIP and/or VXLAN encapsulation changed, need to restart.
2022-07-27 18:19:21.066 [WARNING][2094] felix/daemon.go 715: Felix is shutting down reason="encapsulation changed"
2022-07-27 18:19:21.868 [WARNING][2094] felix/health.go 211: Reporter is not ready. name="int_dataplane"
2022-07-27 18:19:21.869 [WARNING][2094] felix/health.go 173: Health: not ready
2022-07-27 18:19:21.898 [WARNING][2094] felix/health.go 211: Reporter is not ready. name="int_dataplane"

Full log: https://gist.github.com/r0bj/1df72959f5f992efba3544fa5eb89d47

Calico manifest: https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml

Steps to Reproduce (for bugs)

  1. Running calico in version 3.22.1
  2. Upgrade calico to version 3.23.3 from manifest https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml
  3. calico-node pods are failing

Your Environment

  • Calico version: 3.23.3
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes: Client Version: v1.24.3 Kustomize Version: v4.5.4 Server Version: v1.24.3
  • Operating System and version: Ubuntu 18.04

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 22 (13 by maintainers)

Most upvoted comments

Sounds like this might have been introduced by this PR? https://github.com/projectcalico/calico/pull/5576

Fits the code area as well as the time frame. Perhaps we’re not properly handling the difference between vxlanMode: "" and vxlanMode: Never for older clusters that don’t have that field set.

We ran into this as well, modifying the default-ipv4-ippool ippools.crd.projectcalico.org object to add vxlanMode: Never fixed it for us. We’re probably going to roll back to 3.22 though for for now until this is fixed.

This did not happen in recent cluster but did happen when rolling back our code about 6 months and upgrading a cluster to our current configs.

Perfect, so that supports the theory that these pools were created prior to VXLANMode being an option and the newest release is just not properly handling that case, so I think @coutinhop’s fix for this in #6494 is probably good.

@coutinhop it occurs to me that we should look at doing read-time defaulting of that field in case there is any other code that might be hit by the same issue. We should be able to handle that in the Calico client code so any users of the client see “VXLANMode: Never” even if the underlying data doesn’t include the field.

Thanks @caseydavenport, so it seems like the issue is indeed that. I’ll work on the fix!

One thing I’m not sure about - does the cluster use etcd mode or k8s CRD mode? Or are we talking about two different clusters here?

@r0bj is using etcd mode and used calicoctl to get the felixconfig and ip pools, @mikesplain is using kdd and used kubectl, so yeah 2 different clusters with the same issue, I think…

@coutinhop Sure, there is calico-node log with LogSeverityScreen:"Debug": calico-node.log