rancher: Canal readiness probe failed with statuscode 503 for k8s 1.15

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible): Create aws cluster with default options in Rancher 2.3.0

Result: Once the cluster has been created, navigate to the System project, then click on Canal under the kube-system namespace. Click on one of the canal pods then open up the Events tab. Observe the Unhealthy warning for the readiness probe. Screenshot pasted below of this behavior.

image

Logs from the calico-node container:

2019-10-14 18:44:22.732 [INFO][8] startup.go 256: Early log level set to info
2019-10-14 18:44:22.732 [INFO][8] startup.go 272: Using NODENAME environment for node name
2019-10-14 18:44:22.732 [INFO][8] startup.go 284: Determined node name: aws3
2019-10-14 18:44:22.734 [INFO][8] k8s.go 228: Using Calico IPAM
2019-10-14 18:44:22.734 [INFO][8] startup.go 316: Checking datastore connection
2019-10-14 18:44:22.744 [INFO][8] startup.go 340: Datastore connection verified
2019-10-14 18:44:22.744 [INFO][8] startup.go 95: Datastore is ready
2019-10-14 18:44:22.772 [INFO][8] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-10-14 18:44:22.778 [INFO][8] startup.go 181: Using node name: aws3
2019-10-14 18:44:22.809 [INFO][15] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-10-14 18:44:24.038 [WARNING][33] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-10-14 18:44:24.074 [WARNING][33] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-563577510): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-563577510
Error: failed to open object file
2019-10-14 18:44:44.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614e671242a664, ext:20426272735, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:45:10.016 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614e6d698916e9, ext:45816765016, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:47:04.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614e8a10a0555d, ext:160398857927, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:47:24.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614e8f04b5ad88, ext:180198930154, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:47:50.016 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614e9568781150, ext:205798872324, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:49:44.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614eb2118552ef, ext:320413865069, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:50:04.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614eb704b6187f, ext:340198957640, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:50:30.017 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614ebd687a5ed0, ext:365799023174, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:52:24.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614eda118b4aee, ext:480414256219, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:52:44.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614edf04b5d5b4, ext:500198940515, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:53:10.016 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614ee568774cab, ext:525798821911, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:55:04.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614f0210ad7f44, ext:640399720623, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:55:24.432 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614f0705adeec1, ext:660215199869, loc:(*time.Location)(0x2b08080)}}
2019-10-14 18:55:50.017 [WARNING][33] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf614f0d687ee33d, ext:685799319221, loc:(*time.Location)(0x2b08080)}}

Other details that may be helpful: This does not happen in Rancher v2.2.8

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): Rancher v2.3.0
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): AWS
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): VM, t2.large
  • Kubernetes version (use kubectl version): 1.14.6

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 17 (8 by maintainers)

Most upvoted comments

Hi Guys,

I just installed a new cluster with canal via rke and ran into this exact same bug using “latest” helm chart.

I just tried the “solution” which @superseb has kindly shared and can confirm this resolves the issue.

I have pasted the events log below for confirmation.

kube-system | Normal | Created | canal-wg6pv | Created container calico-node | 5 minutes ago
-- | -- | -- | -- | -- | --
kube-system | Normal | Started | canal-wg6pv | Started container calico-node | 5 minutes ago
kube-system | Normal | Pulled | canal-wg6pv | Container image "rancher/coreos-flannel:v0.11.0" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-wg6pv | Created container kube-flannel | 5 minutes ago
kube-system | Normal | Started | canal-wg6pv | Started container kube-flannel | 5 minutes ago
kube-system | Normal | Started | canal-wg6pv | Started container install-cni | 5 minutes ago
kube-system | Normal | Pulled | canal-wg6pv | Container image "rancher/calico-node:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Pulled | canal-wg6pv | Container image "rancher/calico-cni:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-wg6pv | Created container install-cni | 5 minutes ago
kube-system | Normal | Scheduled | canal-wg6pv | Successfully assigned kube-system/canal-wg6pv to rmgr03 | 5 minutes ago
kube-system | Normal | Killing | canal-x2npx | Stopping container calico-node | 5 minutes ago
kube-system | Normal | Killing | canal-x2npx | Stopping container kube-flannel | 5 minutes ago
kube-system | Normal | SuccessfulDelete | canal | Deleted pod: canal-x2npx | 5 minutes ago
kube-system | Normal | SuccessfulCreate | canal | Created pod: canal-wg6pv | 5 minutes ago
kube-system | Normal | Created | canal-rw55m | Created container calico-node | 5 minutes ago
kube-system | Normal | Started | canal-rw55m | Started container calico-node | 5 minutes ago
kube-system | Normal | Pulled | canal-rw55m | Container image "rancher/coreos-flannel:v0.11.0" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-rw55m | Created container kube-flannel | 5 minutes ago
kube-system | Normal | Started | canal-rw55m | Started container kube-flannel | 5 minutes ago
kube-system | Normal | Pulled | canal-rw55m | Container image "rancher/calico-node:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Pulled | canal-rw55m | Container image "rancher/calico-cni:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-rw55m | Created container install-cni | 5 minutes ago
kube-system | Normal | Started | canal-rw55m | Started container install-cni | 5 minutes ago
kube-system | Normal | Killing | canal-6xkx6 | Stopping container calico-node | 5 minutes ago
kube-system | Normal | Scheduled | canal-rw55m | Successfully assigned kube-system/canal-rw55m to rmgr01 | 5 minutes ago
kube-system | Normal | SuccessfulDelete | canal | Deleted pod: canal-6xkx6 | 5 minutes ago
kube-system | Normal | SuccessfulCreate | canal | Created pod: canal-rw55m | 5 minutes ago
kube-system | Normal | Started | canal-gsvht | Started container calico-node | 5 minutes ago
kube-system | Normal | Pulled | canal-gsvht | Container image "rancher/coreos-flannel:v0.11.0" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-gsvht | Created container kube-flannel | 5 minutes ago
kube-system | Normal | Started | canal-gsvht | Started container kube-flannel | 5 minutes ago
kube-system | Normal | Started | canal-gsvht | Started container install-cni | 5 minutes ago
kube-system | Normal | Pulled | canal-gsvht | Container image "rancher/calico-node:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-gsvht | Created container calico-node | 5 minutes ago
kube-system | Normal | Pulled | canal-gsvht | Container image "rancher/calico-cni:v3.7.4" already present on machine | 5 minutes ago
kube-system | Normal | Created | canal-gsvht | Created container install-cni | 5 minutes ago
kube-system | Normal | Scheduled | canal-gsvht | Successfully assigned kube-system/canal-gsvht to rmgr02 | 5 minutes ago
kube-system | Normal | Killing | canal-lrzpb | Stopping container kube-flannel | 5 minutes ago
kube-system | Normal | SuccessfulDelete | canal | Deleted pod: canal-lrzpb | 5 minutes ago
kube-system | Normal | SuccessfulCreate | canal | Created pod: canal-gsvht | 5 minutes ago
kube-system | Warning | Unhealthy | canal-6xkx6 | Readiness probe failed: HTTP probe failed with statuscode: 503 | 6 minutes ago
kube-system | Warning | Unhealthy | canal-x2npx | Readiness probe failed: HTTP probe failed with statuscode: 503 | 6 minutes ago
kube-system | Warning | Unhealthy | canal-lrzpb | Readiness probe failed: HTTP probe failed with statuscode: 503 | 6 minutes ago
cattle-system | Normal | Pulled | cattle-cluster-agent-5768f46d46-l8v5w | Successfully pulled image "rancher/rancher-agent:v2.3.1" | 12 minutes ago
cattle-system | Normal | Created | cattle-cluster-agent-5768f46d46-l8v5w | Created container cluster-register | 12 minutes ago
cattle-system | Normal | Started | cattle-cluster-agent-5768f46d46-l8v5w | Started container cluster-register | 12 minutes ago
cattle-system | Normal | Created | cattle-node-agent-wf2w8 | Created container agent | 12 minutes ago
cattle-system | Normal | Started | cattle-node-agent-wf2w8 | Started container agent | 12 minutes ago
cattle-system | Normal | Created | cattle-node-agent-kljlt | Created container agent | 12 minutes ago
cattle-system | Normal | Started | cattle-node-agent-kljlt | Started container agent | 12 minutes ago

The fix for this is patching the canal DaemonSet and creating the NetworkSet CRD (this will recreate Canal pods), this is use at your own risk until it gets verified and released (test envs only):

kubectl -n kube-system patch daemonset/canal -p '{"spec": {"template": {"spec": {"containers": [{"name": "calico-node", "env": [{"name": "USE_POD_CIDR", "value": "true"}]}]}}}}'

Save the following as crd.yml and run kubectl create -f crd.yml in the cluster:

---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networksets.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkSet
    plural: networksets
    singular: networkset

Please let me know if this solves the issue while we investigate further:

Save the following in crd.yml and execute kubectl create -f crd.yml in the cluster

---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: ipamblocks.crd.projectcalico.org
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: IPAMBlock
    plural: ipamblocks
    singular: ipamblock
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: networksets.crd.projectcalico.org
spec:
  scope: Namespaced
  group: crd.projectcalico.org
  version: v1
  names:
    kind: NetworkSet
    plural: networksets
    singular: networkset