calico: calico-node pods are failing after upgrade from 3.22 to 3.23: felix is not ready: readiness probe reporting 503
Expected Behavior
Calico is working after upgrade to version 3.23.
Current Behavior
calico-node pods are failing after upgrade from 3.22 to 3.23:
$ kubectl describe po calico-node-pg52g
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m35s default-scheduler Successfully assigned kube-system/calico-node-pg52g to dev-k8s-worker-r1
Normal Pulling 2m35s kubelet Pulling image "docker.io/calico/cni:v3.23.3"
Normal Pulled 2m31s kubelet Successfully pulled image "docker.io/calico/cni:v3.23.3" in 3.507665044s
Normal Created 2m31s kubelet Created container install-cni
Normal Started 2m31s kubelet Started container install-cni
Normal Pulling 2m28s kubelet Pulling image "docker.io/calico/node:v3.23.3"
Normal Pulled 2m24s kubelet Successfully pulled image "docker.io/calico/node:v3.23.3" in 3.680538035s
Normal Created 2m24s kubelet Created container mount-bpffs
Normal Started 2m24s kubelet Started container mount-bpffs
Normal Pulled 2m23s kubelet Container image "docker.io/calico/node:v3.23.3" already present on machine
Normal Created 2m23s kubelet Created container calico-node
Normal Started 2m23s kubelet Started container calico-node
Warning Unhealthy 2m21s (x2 over 2m22s) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
Warning Unhealthy 2m15s kubelet Readiness probe failed: 2022-07-27 18:18:21.875 [INFO][438] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 2m5s kubelet Readiness probe failed: 2022-07-27 18:18:31.882 [INFO][665] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 115s kubelet Readiness probe failed: 2022-07-27 18:18:41.881 [INFO][975] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 105s kubelet Readiness probe failed: 2022-07-27 18:18:51.904 [INFO][1252] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 95s kubelet Readiness probe failed: 2022-07-27 18:19:01.907 [INFO][1565] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 85s kubelet Readiness probe failed: 2022-07-27 18:19:11.894 [INFO][1810] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 79s kubelet Readiness probe failed: 2022-07-27 18:19:17.203 [INFO][2018] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 75s kubelet Readiness probe failed: 2022-07-27 18:19:21.868 [INFO][2146] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 43s (x4 over 65s) kubelet (combined from similar events): Readiness probe failed: 2022-07-27 18:19:51.889 [INFO][3017] confd/health.go 180: Number of node(s) with BGP peering established = 10
calico/node is not ready: felix is not ready: readiness probe reporting 503
The only logs that are not INFO:
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.64
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.78
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.79
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.150.80
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.151.192
2022-07-27 18:19:21.025 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.161.192
2022-07-27 18:19:21.026 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.182.192
2022-07-27 18:19:21.026 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.183.192
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.187.64
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.24.192
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.25.0
2022-07-27 18:19:21.027 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.50.64
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.0
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.4
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.59.5
2022-07-27 18:19:21.028 [WARNING][2094] felix/l3_route_resolver.go 645: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=10.205.64.64
2022-07-27 18:19:21.066 [WARNING][2094] felix/daemon.go 1209: IPIP and/or VXLAN encapsulation changed, need to restart.
2022-07-27 18:19:21.066 [WARNING][2094] felix/daemon.go 715: Felix is shutting down reason="encapsulation changed"
2022-07-27 18:19:21.868 [WARNING][2094] felix/health.go 211: Reporter is not ready. name="int_dataplane"
2022-07-27 18:19:21.869 [WARNING][2094] felix/health.go 173: Health: not ready
2022-07-27 18:19:21.898 [WARNING][2094] felix/health.go 211: Reporter is not ready. name="int_dataplane"
Full log: https://gist.github.com/r0bj/1df72959f5f992efba3544fa5eb89d47
Calico manifest: https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml
Steps to Reproduce (for bugs)
- Running calico in version 3.22.1
- Upgrade calico to version 3.23.3 from manifest https://projectcalico.docs.tigera.io/manifests/calico-etcd.yaml
- calico-node pods are failing
Your Environment
- Calico version: 3.23.3
- Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes: Client Version: v1.24.3 Kustomize Version: v4.5.4 Server Version: v1.24.3
- Operating System and version: Ubuntu 18.04
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 22 (13 by maintainers)
Sounds like this might have been introduced by this PR? https://github.com/projectcalico/calico/pull/5576
Fits the code area as well as the time frame. Perhaps we’re not properly handling the difference between
vxlanMode: ""andvxlanMode: Neverfor older clusters that don’t have that field set.We ran into this as well, modifying the
default-ipv4-ippoolippools.crd.projectcalico.orgobject to addvxlanMode: Neverfixed it for us. We’re probably going to roll back to 3.22 though for for now until this is fixed.This did not happen in recent cluster but did happen when rolling back our code about 6 months and upgrading a cluster to our current configs.
@coutinhop looks like we already have a good hook to do this in: https://github.com/projectcalico/calico/blob/0bfeb0f2c7e77b3cf21eaebd5996e514e28d242c/libcalico-go/lib/clientv3/ippool.go#L243-L244
Perfect, so that supports the theory that these pools were created prior to VXLANMode being an option and the newest release is just not properly handling that case, so I think @coutinhop’s fix for this in #6494 is probably good.
@coutinhop it occurs to me that we should look at doing read-time defaulting of that field in case there is any other code that might be hit by the same issue. We should be able to handle that in the Calico client code so any users of the client see “VXLANMode: Never” even if the underlying data doesn’t include the field.
Thanks @caseydavenport, so it seems like the issue is indeed that. I’ll work on the fix!
@r0bj is using etcd mode and used
calicoctlto get the felixconfig and ip pools, @mikesplain is using kdd and usedkubectl, so yeah 2 different clusters with the same issue, I think…@coutinhop Sure, there is calico-node log with
LogSeverityScreen:"Debug": calico-node.log