rke: Job [rke-network-plugin-deploy-job] deletion timed out.

Job [rke-network-plugin-deploy-job] deletion times out during restoring a snapshot process, however, this only happens to one particular snapshot and when I try to restore some other snapshot it works without any errors.

RKE version: I’m using RKE v1.1.3

Docker version: Docker version 19.03.15, build 99e3ed8919

Operating system and kernel: PRETTY_NAME=“Debian GNU/Linux 10 (buster)” NAME=“Debian GNU/Linux” VERSION_ID=“10” VERSION=“10 (buster)” VERSION_CODENAME=buster ID=debian

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) ESXi Virtual Machines

cluster.yml file:

nodes:

  • address: 172.31.21.18 port: “22” internal_address: “” role:
    • controlplane
    • etcd hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: []
  • address: 172.31.21.19 port: “22” internal_address: “” role:
    • controlplane
    • etcd hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: []
  • address: 172.31.21.20 port: “22” internal_address: “” role:
    • worker hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: [] services: etcd: image: “” extra_args: {} extra_binds: [] extra_env: [] external_urls: [] ca_cert: “” cert: “” key: “” path: “” uid: 0 gid: 0 snapshot: null retention: “” creation: “” backup_config: null kube-api: image: “” extra_args: {} extra_binds: [] extra_env: [] service_cluster_ip_range: 10.43.0.0/16 service_node_port_range: “” pod_security_policy: false always_pull_images: false secrets_encryption_config: null audit_log: null admission_configuration: null event_rate_limit: null kube-controller: image: “” extra_args: {} extra_binds: [] extra_env: [] cluster_cidr: 10.42.0.0/16 service_cluster_ip_range: 10.43.0.0/16 scheduler: image: “” extra_args: {} extra_binds: [] extra_env: [] kubelet: image: “” extra_args: {} extra_binds: [] extra_env: [] cluster_domain: cluster.local infra_container_image: “” cluster_dns_server: 10.43.0.10 fail_swap_on: false generate_serving_certificate: false kubeproxy: image: “” extra_args: {} extra_binds: [] extra_env: [] network: plugin: canal options: {} mtu: 0 node_selector: {} update_strategy: null authentication: strategy: x509 sans: [] webhook: null

addons: “”

addons_include: [] system_images: etcd: rancher/coreos-etcd:v3.4.3-rancher1 alpine: rancher/rke-tools:v0.1.58 nginx_proxy: rancher/rke-tools:v0.1.58 cert_downloader: rancher/rke-tools:v0.1.58 kubernetes_services_sidecar: rancher/rke-tools:v0.1.58 kubedns: rancher/k8s-dns-kube-dns:1.15.2 dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.2 kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.2 kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 coredns: rancher/coredns-coredns:1.6.9 coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 nodelocal: rancher/k8s-dns-node-cache:1.15.7 kubernetes: rancher/hyperkube:v1.18.3-rancher2 flannel: rancher/coreos-flannel:v0.12.0 flannel_cni: rancher/flannel-cni:v0.3.0-rancher6 calico_node: rancher/calico-node:v3.13.4 calico_cni: rancher/calico-cni:v3.13.4 calico_controllers: rancher/calico-kube-controllers:v3.13.4 calico_ctl: rancher/calico-ctl:v3.13.4 calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 canal_node: rancher/calico-node:v3.13.4 canal_cni: rancher/calico-cni:v3.13.4 canal_flannel: rancher/coreos-flannel:v0.12.0 canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 weave_node: weaveworks/weave-kube:2.6.4 weave_cni: weaveworks/weave-npc:2.6.4 pod_infra_container: rancher/pause:3.1 ingress: rancher/nginx-ingress-controller:nginx-0.32.0-rancher1 ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1 metrics_server: rancher/metrics-server:v0.3.6 windows_pod_infra_container: rancher/kubelet-pause:v0.1.4 ssh_key_path: /root/.ssh/id_rsa ssh_cert_path: “” ssh_agent_auth: false authorization: mode: rbac options: {} ignore_docker_version: null kubernetes_version: “” private_registries: [] ingress: provider: “” options: {} node_selector: {} extra_args: {} dns_policy: “” extra_envs: [] extra_volumes: [] extra_volume_mounts: [] update_strategy: null cluster_name: “” cloud_provider: name: “” prefix_path: “” addon_job_timeout: 5000 bastion_host: address: “” port: “” user: “” ssh_key: “” ssh_key_path: “” ssh_cert: “” ssh_cert_path: “” monitoring: provider: “” options: {} node_selector: {} update_strategy: null replicas: null restore: restore: false snapshot_name: “” dns: null

Steps to Reproduce: the snapshot is locally stored on etcd node in /opt/rke/etcd-snapshots/ after executing the command rke etcd snapshot-restore --name <NAME> it gives a fatal error saying FATA[0346] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value, however, this only happens with one particular snapshot and using another snapshot works fine.

Results: INFO[0375] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18] INFO[0376] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19] INFO[0377] Removing container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1 INFO[0377] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18] INFO[0377] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.18] INFO[0378] Removing container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1 INFO[0378] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19] INFO[0378] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.19] INFO[0379] Starting container [kube-proxy] on host [172.31.21.18], try rancher/rke#1 INFO[0380] Starting container [kube-proxy] on host [172.31.21.19], try rancher/rke#1 INFO[0380] [worker] Successfully started [kube-proxy] container on host [172.31.21.18] INFO[0380] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.18] INFO[0381] [worker] Successfully started [kube-proxy] container on host [172.31.21.19] INFO[0381] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.19] INFO[0383] [healthcheck] service [kubelet] on host [172.31.21.20] is healthy INFO[0383] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20] INFO[0386] Starting container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1 INFO[0386] [healthcheck] service [kube-proxy] on host [172.31.21.18] is healthy INFO[0386] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18] INFO[0387] [healthcheck] service [kube-proxy] on host [172.31.21.19] is healthy INFO[0387] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19] INFO[0387] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20] INFO[0389] Starting container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1 INFO[0390] Removing container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1 INFO[0390] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20] INFO[0390] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.20] INFO[0391] Starting container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1 INFO[0391] Starting container [kube-proxy] on host [172.31.21.20], try rancher/rke#1 INFO[0392] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18] INFO[0393] [worker] Successfully started [kube-proxy] container on host [172.31.21.20] INFO[0393] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.20] INFO[0393] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19] INFO[0393] Removing container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1 INFO[0394] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18] INFO[0395] Removing container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1 INFO[0395] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19] INFO[0398] [healthcheck] service [kube-proxy] on host [172.31.21.20] is healthy INFO[0398] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20] INFO[0400] Starting container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1 INFO[0402] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20] INFO[0402] Removing container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1 INFO[0403] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20] INFO[0403] [worker] Successfully started Worker Plane… INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20] INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18] INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19] INFO[0406] Starting container [rke-log-cleaner] on host [172.31.21.20], try rancher/rke#1 INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.19], try rancher/rke#1 INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.18], try rancher/rke#1 INFO[0408] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.20] INFO[0408] Removing container [rke-log-cleaner] on host [172.31.21.20], try rancher/rke#1 INFO[0409] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.20] INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.18] INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.19] INFO[0410] Removing container [rke-log-cleaner] on host [172.31.21.18], try rancher/rke#1 INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.18] INFO[0411] Removing container [rke-log-cleaner] on host [172.31.21.19], try rancher/rke#1 INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.19] INFO[0411] [sync] Syncing nodes Labels and Taints
INFO[0411] [sync] Successfully synced nodes Labels and Taints INFO[0411] [network] Setting up network plugin: canal
INFO[0411] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes INFO[0412] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes INFO[0412] [addons] Executing deploy job rke-network-plugin FATA[0500] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

It’s more about the type of disks, in this case SAS. Based on the fio output, the disks are not sufficient and there is not really a bug here. It seems etcd is just slow causing the cluster not to be ready with the default timeout. Raising this timeout can help or put the machine on a hypervisor with SSD disk which will have better access time/IOPS. Let me know if there is anything else you want to know.

Yeah this doesn’t look sufficient, what storage is attached to the VMware cluster which is used for the machine disks?

It seems the machines are not fast enough to make sure the pods come up in time before the timeout. Raising the timeout or running rke up again makes sure there is more time to wait for the pods to become ready. What Rancher version have you been running on this cluster?

Also, please post full specifications of the machines (CPU/memory/disk type and IOPS) to determine if the machines are adequate. See https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd for more info.

The screenshot does not show the exact commands that I posted earlier, it will not work if you don’t use the exact command. Similar to the other 2 commands regarding endpoint, please use the exact commands from the comment earlier and from the troubleshooting page and share the error if the command does not work (including the command you used).

Based on the logs the storage seems to be not fast enough (or at least the log shows a long execute time for a lot of requests), what kind/type of disk are in the server?

The part where the etcd snapshot gets restored does not show any errors, what is the output when you run the etcdctl commands? And what are the logs from the etcd container? If it can’t be restored successfully, RKE should log that and error out but I need to know what the state of etcd is.

Please include the full log of the snapshot-restore action with --debug included so we get debug logging.

Please share the output of:

  • ls -la /opt/rke/etcd-snapshots/NAME (for the size of the snapshot)
  • ls -la of the snapshot file itself inside the archive (if possible)
  • When restore is done using the snapshot, please run the following to get a resource count overview of what is in the snapshot: docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -n
  • Also the output of the commands from that involve endpoint, so endpoint health and endpoint status: https://rancher.com/docs/rancher/v2.x/en/troubleshooting/kubernetes-components/etcd/

Apparently that snapshot has a lot of resources that take time or resources to load/process causing the timeout to be hit. In v1.2 line we already raised the timeout in https://github.com/rancher/rke/pull/2289, for now you can manually raise the timeout. If you want to know why this happens with this snapshot, I either need the snapshot or I can give you commands to get an idea why it might happen.