rke: Kubelet timeout generating ImagePullBackOff error
TL;DR
Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ Can I manage it? Does this file exist? Didn’t find in /var/lib/kubelet
# pwd
/var/lib/kubelet
# ls -lha
total 16K
drwxr-xr-x 9 root root 185 Sep 5 13:20 .
drwxr-xr-x. 42 root root 4.0K Sep 22 15:50 ..
-rw------- 1 root root 62 Sep 5 13:20 cpu_manager_state
drwxr-xr-x 2 root root 45 Nov 1 11:27 device-plugins
-rw------- 1 root root 61 Sep 5 13:20 memory_manager_state
drwxr-xr-x 2 root root 44 Sep 5 13:20 pki
drwxr-x--- 2 root root 6 Sep 5 13:20 plugins
drwxr-x--- 2 root root 6 Sep 5 13:20 plugins_registry
drwxr-x--- 2 root root 26 Nov 1 11:27 pod-resources
drwxr-x--- 11 root root 4.0K Oct 24 23:57 pods
drwxr-xr-x 2 root root 6 Sep 5 13:20 volumeplugins
Explain
Recently we’ve upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we’ve noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull. To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.
Error: ImagePullBackOff
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7
NAME READY STATUS RESTARTS AGE
test-imagepullback-c7fc59d86-gwtc7 0/1 ContainerCreating 0 2m
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7
NAME READY STATUS RESTARTS AGE
test-imagepullback-c7fc59d86-gwtc7 0/1 ErrImagePull 0 2m1s
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7
NAME READY STATUS RESTARTS AGE
test-imagepullback-c7fc59d86-gwtc7 0/1 ImagePullBackOff 0 2m12s
Searching for the problem, we discovered that the error is caused by a timeout in kubelet’s request (2 minutes, accourding to the doc https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/), which could be raised with a flag –runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:
[...]
kubelet:
extra_args:
runtime-request-timeout: 10m
fail_swap_on: false
[...]
The process running, showing that the parameter reflects to kubelet configuration
# ps -ef | grep runtime-request-timeout
root 7286 7267 0 Nov01 ? 00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet {...} --runtime-request-timeout=10m {...}
In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file. So I have some doubts:
- Where I change it?
- This file exist in Rancher or I need to create it?
- Is there a way to bypass with another parameter in extra_args?
- Why this is happening now? Is because the deprecation of dockershim?
I read this docs too, but no sucess:
- https://rancher.com/docs/rke/latest/en/example-yamls/
- https://rancher.com/docs/rke/latest/en/config-options/services/services-extras/
- https://docs.ranchermanager.rancher.io/v2.5/faq/technical-items
Configs and current versions
K8s version:
# kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.4
RKE version:
# rke --version
rke version v1.3.15
Docker version: (docker version,docker info preferred)
# docker --version
Docker version 20.10.7, build f0df350
# docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
scan: Docker Scan (Docker Inc., v0.12.0)
Server:
Containers: 20
Running: 20
Paused: 0
Stopped: 0
Images: 8
Server Version: 20.10.21
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.76.1.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.58GiB
Name: anchieta
ID: OZNJ:RKES:NTOH:G37X:4NYO:IJ3U:SKHO:FFG3:RJ7B:GCCJ:XOZN:NRHE
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
# uname -r
3.10.0-1160.76.1.el7.x86_64
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Vmware
cluster.yml file:
nodes:
- address: host1
port: "22"
internal_address: ""
role:
- controlplane
- etcd
hostname_override: ""
user: rancher
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: ~/.ssh/id_rsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
[...]
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
uid: 0
gid: 0
snapshot: null
retention: ""
creation: ""
backup_config: null
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
service_cluster_ip_range: 10.43.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config: null
audit_log: null
admission_configuration: null
event_rate_limit: null
kube-controller:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
cluster_cidr: 10.42.0.0/16
service_cluster_ip_range: 10.43.0.0/16
scheduler:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
kubelet:
image: ""
extra_args:
runtime-request-timeout: 30m
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.43.0.10
fail_swap_on: false
generate_serving_certificate: false
kubeproxy:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
win_extra_args: {}
win_extra_binds: []
win_extra_env: []
network:
plugin: canal
options: {}
mtu: 0
node_selector: {}
update_strategy: null
tolerations: []
authentication:
strategy: x509
sans: []
webhook: null
addons: ""
addons_include: []
#Versões definidas em kubernetes_version
#system_images:
# etcd: rancher/mirrored-coreos-etcd:v3.5.0
# alpine: rancher/rke-tools:v0.1.78
# nginx_proxy: rancher/rke-tools:v0.1.78
# cert_downloader: rancher/rke-tools:v0.1.78
# kubernetes_services_sidecar: rancher/rke-tools:v0.1.78
# kubedns: rancher/mirrored-k8s-dns-kube-dns:1.17.4
# dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.17.4
# kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.17.4
# kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
# coredns: rancher/mirrored-coredns-coredns:1.8.6
# coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.5
# nodelocal: rancher/mirrored-k8s-dns-node-cache:1.21.1
# #kubernetes: rancher/hyperkube:v1.24.4-rancher1
# flannel: rancher/mirrored-coreos-flannel:v0.15.1
# flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
# calico_node: rancher/mirrored-calico-node:v3.21.1
# calico_cni: rancher/mirrored-calico-cni:v3.21.1
# calico_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
# calico_ctl: rancher/mirrored-calico-ctl:v3.21.1
# calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
# canal_node: rancher/mirrored-calico-node:v3.21.1
# canal_cni: rancher/mirrored-calico-cni:v3.21.1
# canal_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
# canal_flannel: rancher/mirrored-coreos-flannel:v0.15.1
# canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
# weave_node: weaveworks/weave-kube:2.8.1
# weave_cni: weaveworks/weave-npc:2.8.1
# pod_infra_container: rancher/mirrored-pause:3.5
# ingress: rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
# ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
# ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v1.1.1
# metrics_server: rancher/mirrored-metrics-server:v0.5.1
# windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
# aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
# aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
# aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
# aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
# aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
# aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
# aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
# aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: null
enable_cri_dockerd: null
kubernetes_version: "v1.24.4-rancher1-1"
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
update_strategy: null
http_port: 0
https_port: 0
network_mode: ""
tolerations: []
default_backend: null
default_http_backend_priority_class_name: ""
nginx_ingress_controller_priority_class_name: ""
default_ingress_class: null
cluster_name: ""
cloud_provider:
name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: ""
ssh_cert: ""
ssh_cert_path: ""
ignore_proxy_env_vars: false
monitoring:
provider: ""
options: {}
node_selector: {}
update_strategy: null
replicas: null
tolerations: []
metrics_server_priority_class_name: ""
restore:
restore: false
snapshot_name: ""
rotate_encryption_key: false
dns: null
I would be grateful if this help me and others to solve this annoying issue.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 5
- Comments: 21 (6 by maintainers)
Hi @likku123 @gmanera @iTaybb @vinibodruch Sorry for the late reply, I was out sick for the past two weeks and just returned today. I am glad to see that you guys figured out the root cause and the fix! I can definitely update the
cri-dockerdversion used in rancher/rke-tools to v0.2.6. I will fit it into the team’s schedule and try to get the fix out ASAP, but sorry that I cannot guarantee a date. Thank you for your understanding.and a caveat: AFAIK, the kubelet process does not auto-restart when the changes are made in the config file, which means you need to restart the kubelet container after changes are made to the “external” config file.
Rancher/RKE does not use the kubelet config file to configure kubelet, which sadly means you cannot find it anywhere. But it does not mean you cannot use it, and actually, you are very close to the final solution:
You need to set both the
extra_argsandextra_bindsto make it work. In your cluser.yml it will look like the followingAnd of course, you need to create/put such a config file on the control plan node beforehand.
I hope this is helpful.
This is definitely issue with cri-dockerd version which comes along with rke-tools. Right now the version is /opt/rke-tools/bin# ./cri-dockerd --version cri-dockerd 0.2.4 (4b57f30) As per this link https://github.com/kubernetes/minikube/issues/14789#issuecomment-1233138069 cri-dockerd 0.2.6 is the patch which solves the timeout issue.
Any suggestions to deploy cri-dockerd 0.2.6 in my present setup
@likku123 can you do the following checks on the container
kube-apiserverin the control plan node:--configis setIf all the above look right, it means RKE has configured the kube-apiserver properly, then I will doubt if it is an upstream issue or something wrong outside of RKE.