rke: Calico node networking errors
RKE version:
v0.2.8
Docker version: (docker version,docker info preferred)
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
CentOS 7.6 Kernel 3.10.0-957.1.3.el7.x86_64 and CentOS 7.6 Kernel 3.10.0-957.27.2.el7.x86_64
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack
cluster.yml file:
# Nodes: this is the only required configuration. Everything else is optional.
nodes:
# Controlplane & Etcd nodes
- address: 10.253.10.7
user: ansible
role:
- controlplane
- etcd
hostname_override: xxxxxxx
- address: 10.253.10.8
user: ansible
role:
- controlplane
- etcd
hostname_override: xxxxxxx
- address: 10.253.10.9
user: ansible
role:
- controlplane
- etcd
hostname_override: xxxxxxx
# Worker nodes
- address: 10.253.10.6
user: ansible
role:
- worker
hostname_override: xxxxxxx
- address: 10.253.10.4
user: ansible
role:
- worker
hostname_override: xxxxxxx
- address: 10.253.10.5
user: ansible
role:
- worker
hostname_override: xxxxxxx
# Enable use of SSH agent to use SSH private keys with passphrase
# This requires the environment `SSH_AUTH_SOCK` configured pointing to your SSH agent which has the private key added
ssh_agent_auth: true
# Set the name of the Kubernetes cluster
cluster_name: xxxxxxxxxxxx
# Check out the kubernetes version support one the rancher/rke Github page: https://github.com/rancher/rke/releases/
kubernetes_version: v1.15.3-rancher1-1
services:
etcd:
backup_config:
interval_hours: 12
retention: 6
kube-api:
# IP range for any services created on Kubernetes
# This must match the service_cluster_ip_range in kube-controller
service_cluster_ip_range: 10.21.0.0/16
# Expose a different port range for NodePort services
service_node_port_range: 30000-32767
pod_security_policy: false
extra_args:
oidc-client-id: "spn:xxxxxxxxxx"
oidc-issuer-url: "https://sts.windows.net/xxxxxxxxxx/"
oidc-username-claim: "upn"
oidc-groups-claim: "groups"
v: 2
kube-controller:
# CIDR pool used to assign IP addresses to pods in the cluster
cluster_cidr: 10.20.0.0/16
# IP range for any services created on Kubernetes
# This must match the service_cluster_ip_range in kube-api
service_cluster_ip_range: 10.21.0.0/16
extra_args:
v: 2
kubelet:
# Base domain for the cluster
cluster_domain: xxxxxxxxxxx
# IP address for the DNS service endpoint
cluster_dns_server: 10.21.0.10
# Fail if swap is on
fail_swap_on: true
extra_args:
v: 2
# Currently, only authentication strategy supported is x509.
# You can optionally create additional SANs (hostnames or IPs) to add to
# the API server PKI certificate.
# This is useful if you want to use a load balancer for the control plane servers.
authentication:
strategy: x509 # Use x509 for cluster administrator credentials and keep them very safe after you've created them
sans:
- "xxx.xxx.xxx.xxx"
cloud_provider:
name: openstack
openstackCloudProvider:
global:
username: xxxxxxxx
password: xxxxxxxx
auth-url: xxxxxxx
tenant-id: xxxxxxx
domain-id: default
load_balancer:
subnet-id: 88a8968f-2d6d-494e-a67e-dab207d068f0
block_storage:
bs-version: v3
trust-device-path: false
ignore-volume-az: false
# There are several network plug-ins that work, but we default to canal
network:
plugin: canal
# Specify DNS provider (coredns or kube-dns)
dns:
provider: coredns
# We disable the ingress controller deployment because we are going to run multiple ingress controllers with our own configuration
ingress:
provider: none
# All add-on manifests MUST specify a namespace
# addons: ''
# addons_include: []
Steps to Reproduce:
Deploy an empty cluster with RKE
Results:
2019-08-29 14:26:48.610 [INFO][9] startup.go 256: Early log level set to info
2019-08-29 14:26:48.610 [INFO][9] startup.go 272: Using NODENAME environment for node name
2019-08-29 14:26:48.610 [INFO][9] startup.go 284: Determined node name: nlsvpkubec01
2019-08-29 14:26:48.614 [INFO][9] k8s.go 228: Using Calico IPAM
2019-08-29 14:26:48.614 [INFO][9] startup.go 316: Checking datastore connection
2019-08-29 14:26:48.630 [INFO][9] startup.go 340: Datastore connection verified
2019-08-29 14:26:48.630 [INFO][9] startup.go 95: Datastore is ready
2019-08-29 14:26:48.655 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2019-08-29 14:26:48.661 [INFO][9] startup.go 181: Using node name: nlsvpkubec01
2019-08-29 14:26:48.693 [INFO][18] k8s.go 228: Using Calico IPAM
CALICO_NETWORKING_BACKEND is none - no BGP daemon running
Calico node started successfully
2019-08-29 14:26:49.845 [WARNING][38] int_dataplane.go 354: Failed to query VXLAN device error=Link not found
2019-08-29 14:26:49.881 [WARNING][38] int_dataplane.go 384: Failed to cleanup preexisting XDP state error=failed to load XDP program (/tmp/felix-xdp-942558251): stat /sys/fs/bpf/calico/xdp/prefilter_v1_calico_tmp_A: no such file or directory
libbpf: failed to get EHDR from /tmp/felix-xdp-942558251
Error: failed to open object file
2019-08-29 14:27:03.250 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf52160db4d62435, ext:13105494327, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:28:26.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf521622a8ce8c8a, ext:96903670157, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:29:36.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf5216341ce3e9fd, ext:166703743746, loc:(*time.Location)(0x2b08080)}}
2019-08-29 14:31:06.819 [WARNING][38] health.go 190: Reporter failed readiness checks name="async_calc_graph" reporter-state=&health.reporterState{name:"async_calc_graph", reports:health.HealthReport{Live:true, Ready:true}, timeout:20000000000, latest:health.HealthReport{Live:true, Ready:false}, timestamp:time.Time{wall:0xbf52164aa8e3ca35, ext:256905062112, loc:(*time.Location)(0x2b08080)}}
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 5
- Comments: 25 (2 by maintainers)
Any resolution to this? I’m seeing this in one of our test clusters we just upgraded to 1.15.5 using Rancher 2.2.9
Since upgrading to Rancher v2.3.4 and Kubernetes v1.17.0-rancher1-2 I’m getting Calico errors on some of my nodes—the ones that happen to be virtual machines (Hyper-V). Bare metal ones are fine.
Pod:
canal-xyzabc, containercalico-node(imagerancher/calico-node:v3.10.2):This seem to be this issue: https://github.com/coreos/flannel/issues/1321
Adding a file
/etc/systemd/network/50-flannel.linkwith the following content should fix the issue:E.g. with ignition:
For more context:
Hello, We hit the same error when deploying 1.15.3 with canal. We haven’t seen this error with older k8s versions and canal, neither with 1.15.3 and calico.
I think this is related with https://github.com/projectcalico/calico/issues/2191 Fixed it disabling IPv6 on the node
@imle Could you please provide the exact steps you took?
I had this issue as well. I did an empty config gen and copied over the new container versions and that seems to have resolved everything for me.
This problem seems to be present with Rancher 2.3.0 and 1.15.4.
@olivierlemasle Thank you! This appears to solve our issues!
On a sandbox cluster that had this problem I was able to recover by doing the following (just fishing as nothing else worked). I’m advising not to try this unless you are quite sure you can live with a failed cluster. But it worked for me.
Please see https://github.com/rancher/rancher/issues/23430#issuecomment-542611269 and let me know if it resolves the issue.