rke: Job [rke-network-plugin-deploy-job] deletion timed out.
Job [rke-network-plugin-deploy-job] deletion times out during restoring a snapshot process, however, this only happens to one particular snapshot and when I try to restore some other snapshot it works without any errors.
RKE version: I’m using RKE v1.1.3
Docker version: Docker version 19.03.15, build 99e3ed8919
Operating system and kernel: PRETTY_NAME=“Debian GNU/Linux 10 (buster)” NAME=“Debian GNU/Linux” VERSION_ID=“10” VERSION=“10 (buster)” VERSION_CODENAME=buster ID=debian
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) ESXi Virtual Machines
cluster.yml file:
nodes:
- address: 172.31.21.18
port: “22”
internal_address: “”
role:
- controlplane
- etcd hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: []
- address: 172.31.21.19
port: “22”
internal_address: “”
role:
- controlplane
- etcd hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: []
- address: 172.31.21.20
port: “22”
internal_address: “”
role:
- worker hostname_override: “” user: rke docker_socket: /var/run/docker.sock ssh_key: “” ssh_key_path: /root/.ssh/id_rsa ssh_cert: “” ssh_cert_path: “” labels: {} taints: [] services: etcd: image: “” extra_args: {} extra_binds: [] extra_env: [] external_urls: [] ca_cert: “” cert: “” key: “” path: “” uid: 0 gid: 0 snapshot: null retention: “” creation: “” backup_config: null kube-api: image: “” extra_args: {} extra_binds: [] extra_env: [] service_cluster_ip_range: 10.43.0.0/16 service_node_port_range: “” pod_security_policy: false always_pull_images: false secrets_encryption_config: null audit_log: null admission_configuration: null event_rate_limit: null kube-controller: image: “” extra_args: {} extra_binds: [] extra_env: [] cluster_cidr: 10.42.0.0/16 service_cluster_ip_range: 10.43.0.0/16 scheduler: image: “” extra_args: {} extra_binds: [] extra_env: [] kubelet: image: “” extra_args: {} extra_binds: [] extra_env: [] cluster_domain: cluster.local infra_container_image: “” cluster_dns_server: 10.43.0.10 fail_swap_on: false generate_serving_certificate: false kubeproxy: image: “” extra_args: {} extra_binds: [] extra_env: [] network: plugin: canal options: {} mtu: 0 node_selector: {} update_strategy: null authentication: strategy: x509 sans: [] webhook: null
addons: “”
addons_include: [] system_images: etcd: rancher/coreos-etcd:v3.4.3-rancher1 alpine: rancher/rke-tools:v0.1.58 nginx_proxy: rancher/rke-tools:v0.1.58 cert_downloader: rancher/rke-tools:v0.1.58 kubernetes_services_sidecar: rancher/rke-tools:v0.1.58 kubedns: rancher/k8s-dns-kube-dns:1.15.2 dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.2 kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.2 kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 coredns: rancher/coredns-coredns:1.6.9 coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.7.1 nodelocal: rancher/k8s-dns-node-cache:1.15.7 kubernetes: rancher/hyperkube:v1.18.3-rancher2 flannel: rancher/coreos-flannel:v0.12.0 flannel_cni: rancher/flannel-cni:v0.3.0-rancher6 calico_node: rancher/calico-node:v3.13.4 calico_cni: rancher/calico-cni:v3.13.4 calico_controllers: rancher/calico-kube-controllers:v3.13.4 calico_ctl: rancher/calico-ctl:v3.13.4 calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 canal_node: rancher/calico-node:v3.13.4 canal_cni: rancher/calico-cni:v3.13.4 canal_flannel: rancher/coreos-flannel:v0.12.0 canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.13.4 weave_node: weaveworks/weave-kube:2.6.4 weave_cni: weaveworks/weave-npc:2.6.4 pod_infra_container: rancher/pause:3.1 ingress: rancher/nginx-ingress-controller:nginx-0.32.0-rancher1 ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1 metrics_server: rancher/metrics-server:v0.3.6 windows_pod_infra_container: rancher/kubelet-pause:v0.1.4 ssh_key_path: /root/.ssh/id_rsa ssh_cert_path: “” ssh_agent_auth: false authorization: mode: rbac options: {} ignore_docker_version: null kubernetes_version: “” private_registries: [] ingress: provider: “” options: {} node_selector: {} extra_args: {} dns_policy: “” extra_envs: [] extra_volumes: [] extra_volume_mounts: [] update_strategy: null cluster_name: “” cloud_provider: name: “” prefix_path: “” addon_job_timeout: 5000 bastion_host: address: “” port: “” user: “” ssh_key: “” ssh_key_path: “” ssh_cert: “” ssh_cert_path: “” monitoring: provider: “” options: {} node_selector: {} update_strategy: null replicas: null restore: restore: false snapshot_name: “” dns: null
Steps to Reproduce:
the snapshot is locally stored on etcd node in /opt/rke/etcd-snapshots/
after executing the command rke etcd snapshot-restore --name <NAME> it gives a fatal error saying FATA[0346] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value, however, this only happens with one particular snapshot and using another snapshot works fine.
Results:
INFO[0375] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18]
INFO[0376] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19]
INFO[0377] Removing container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1
INFO[0377] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18]
INFO[0377] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.18]
INFO[0378] Removing container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1
INFO[0378] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19]
INFO[0378] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.19]
INFO[0379] Starting container [kube-proxy] on host [172.31.21.18], try rancher/rke#1
INFO[0380] Starting container [kube-proxy] on host [172.31.21.19], try rancher/rke#1
INFO[0380] [worker] Successfully started [kube-proxy] container on host [172.31.21.18]
INFO[0380] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.18]
INFO[0381] [worker] Successfully started [kube-proxy] container on host [172.31.21.19]
INFO[0381] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.19]
INFO[0383] [healthcheck] service [kubelet] on host [172.31.21.20] is healthy
INFO[0383] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0386] Starting container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1
INFO[0386] [healthcheck] service [kube-proxy] on host [172.31.21.18] is healthy
INFO[0386] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18]
INFO[0387] [healthcheck] service [kube-proxy] on host [172.31.21.19] is healthy
INFO[0387] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19]
INFO[0387] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20]
INFO[0389] Starting container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1
INFO[0390] Removing container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1
INFO[0390] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20]
INFO[0390] Image [rancher/hyperkube:v1.18.3-rancher2] exists on host [172.31.21.20]
INFO[0391] Starting container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1
INFO[0391] Starting container [kube-proxy] on host [172.31.21.20], try rancher/rke#1
INFO[0392] [worker] Successfully started [rke-log-linker] container on host [172.31.21.18]
INFO[0393] [worker] Successfully started [kube-proxy] container on host [172.31.21.20]
INFO[0393] [healthcheck] Start Healthcheck on service [kube-proxy] on host [172.31.21.20]
INFO[0393] [worker] Successfully started [rke-log-linker] container on host [172.31.21.19]
INFO[0393] Removing container [rke-log-linker] on host [172.31.21.18], try rancher/rke#1
INFO[0394] [remove/rke-log-linker] Successfully removed container on host [172.31.21.18]
INFO[0395] Removing container [rke-log-linker] on host [172.31.21.19], try rancher/rke#1
INFO[0395] [remove/rke-log-linker] Successfully removed container on host [172.31.21.19]
INFO[0398] [healthcheck] service [kube-proxy] on host [172.31.21.20] is healthy
INFO[0398] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0400] Starting container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1
INFO[0402] [worker] Successfully started [rke-log-linker] container on host [172.31.21.20]
INFO[0402] Removing container [rke-log-linker] on host [172.31.21.20], try rancher/rke#1
INFO[0403] [remove/rke-log-linker] Successfully removed container on host [172.31.21.20]
INFO[0403] [worker] Successfully started Worker Plane…
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.20]
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.18]
INFO[0403] Image [rancher/rke-tools:v0.1.58] exists on host [172.31.21.19]
INFO[0406] Starting container [rke-log-cleaner] on host [172.31.21.20], try rancher/rke#1
INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.19], try rancher/rke#1
INFO[0407] Starting container [rke-log-cleaner] on host [172.31.21.18], try rancher/rke#1
INFO[0408] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.20]
INFO[0408] Removing container [rke-log-cleaner] on host [172.31.21.20], try rancher/rke#1
INFO[0409] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.20]
INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.18]
INFO[0409] [cleanup] Successfully started [rke-log-cleaner] container on host [172.31.21.19]
INFO[0410] Removing container [rke-log-cleaner] on host [172.31.21.18], try rancher/rke#1
INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.18]
INFO[0411] Removing container [rke-log-cleaner] on host [172.31.21.19], try rancher/rke#1
INFO[0411] [remove/rke-log-cleaner] Successfully removed container on host [172.31.21.19]
INFO[0411] [sync] Syncing nodes Labels and Taints
INFO[0411] [sync] Successfully synced nodes Labels and Taints
INFO[0411] [network] Setting up network plugin: canal
INFO[0411] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0412] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0412] [addons] Executing deploy job rke-network-plugin
FATA[0500] [k8s] Job [rke-network-plugin-deploy-job] deletion timed out. Consider increasing addon_job_timeout value
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 16 (8 by maintainers)
It’s more about the type of disks, in this case SAS. Based on the fio output, the disks are not sufficient and there is not really a bug here. It seems etcd is just slow causing the cluster not to be ready with the default timeout. Raising this timeout can help or put the machine on a hypervisor with SSD disk which will have better access time/IOPS. Let me know if there is anything else you want to know.
Yeah this doesn’t look sufficient, what storage is attached to the VMware cluster which is used for the machine disks?
It seems the machines are not fast enough to make sure the pods come up in time before the timeout. Raising the timeout or running
rke upagain makes sure there is more time to wait for the pods to become ready. What Rancher version have you been running on this cluster?Also, please post full specifications of the machines (CPU/memory/disk type and IOPS) to determine if the machines are adequate. See https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcd for more info.
The screenshot does not show the exact commands that I posted earlier, it will not work if you don’t use the exact command. Similar to the other 2 commands regarding endpoint, please use the exact commands from the comment earlier and from the troubleshooting page and share the error if the command does not work (including the command you used).
Based on the logs the storage seems to be not fast enough (or at least the log shows a long execute time for a lot of requests), what kind/type of disk are in the server?
The part where the etcd snapshot gets restored does not show any errors, what is the output when you run the etcdctl commands? And what are the logs from the
etcdcontainer? If it can’t be restored successfully, RKE should log that and error out but I need to know what the state of etcd is.Please include the full log of the snapshot-restore action with
--debugincluded so we get debug logging.Please share the output of:
ls -la /opt/rke/etcd-snapshots/NAME(for the size of the snapshot)ls -laof the snapshot file itself inside the archive (if possible)docker exec etcd etcdctl get /registry --prefix=true --keys-only | grep -v ^$ | awk -F'/' '{ if ($3 ~ /cattle.io/) {h[$3"/"$4]++} else { h[$3]++ }} END { for(k in h) print h[k], k }' | sort -nApparently that snapshot has a lot of resources that take time or resources to load/process causing the timeout to be hit. In v1.2 line we already raised the timeout in https://github.com/rancher/rke/pull/2289, for now you can manually raise the timeout. If you want to know why this happens with this snapshot, I either need the snapshot or I can give you commands to get an idea why it might happen.