rke: Kubelet timeout generating ImagePullBackOff error

TL;DR

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ Can I manage it? Does this file exist? Didn’t find in /var/lib/kubelet

# pwd
/var/lib/kubelet
# ls -lha
total 16K
drwxr-xr-x   9 root root  185 Sep  5 13:20 .
drwxr-xr-x. 42 root root 4.0K Sep 22 15:50 ..
-rw-------   1 root root   62 Sep  5 13:20 cpu_manager_state
drwxr-xr-x   2 root root   45 Nov  1 11:27 device-plugins
-rw-------   1 root root   61 Sep  5 13:20 memory_manager_state
drwxr-xr-x   2 root root   44 Sep  5 13:20 pki
drwxr-x---   2 root root    6 Sep  5 13:20 plugins
drwxr-x---   2 root root    6 Sep  5 13:20 plugins_registry
drwxr-x---   2 root root   26 Nov  1 11:27 pod-resources
drwxr-x---  11 root root 4.0K Oct 24 23:57 pods
drwxr-xr-x   2 root root    6 Sep  5 13:20 volumeplugins

Explain

Recently we’ve upgraded the Kubernetes version to v1.24.4-rancher1-1 and to Rancher 2.6.9. Everything worked fine, but recently we’ve noticed a new behavior: If a image is to big or takes more than 2 minutes to accomplish the download, the Kubernetes raise an ErrImagePull. To bypass this error, I need to login to the cluster, do a docker pull <image> to stop this error.

Error: ImagePullBackOff

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS              RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ContainerCreating   0          2m

~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                            
NAME                                 READY   STATUS         RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ErrImagePull   0          2m1s
                                                                                                                                                      
~ ❯ kubectl get pods -n mobile test-imagepullback-c7fc59d86-gwtc7                                                                                                           
NAME                                 READY   STATUS             RESTARTS   AGE
test-imagepullback-c7fc59d86-gwtc7   0/1     ImagePullBackOff   0          2m12s

Searching for the problem, we discovered that the error is caused by a timeout in kubelet’s request (2 minutes, accourding to the doc https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/), which could be raised with a flag –runtime-request-timeout duration. Changing the cluster.yaml with the below parameters, nothing happens:

[...]
    kubelet:
      extra_args:
        runtime-request-timeout: 10m
      fail_swap_on: false
[...]

The process running, showing that the parameter reflects to kubelet configuration

# ps -ef | grep runtime-request-timeout
root      7286  7267  0 Nov01 ?        00:00:00 /bin/bash /opt/rke-tools/entrypoint.sh kubelet {...} --runtime-request-timeout=10m {...}

In the official page, this parameter is deprecated, which explains this behavior, and to change it I need to alter a parameter named runtimeRequestTimeout inside a config-file. So I have some doubts:

Where I change it?
This file exist in Rancher or I need to create it?
Is there a way to bypass with another parameter in extra_args?
Why this is happening now? Is because the deprecation of dockershim?

I read this docs too, but no sucess:

Configs and current versions

K8s version:

#  kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.4

RKE version:

# rke --version
rke version v1.3.15

Docker version: (docker version,docker info preferred)

# docker --version
Docker version 20.10.7, build f0df350

# docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 20
  Running: 20
  Paused: 0
  Stopped: 0
 Images: 8
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 1c90a442489720eec95342e1789ee8a5e1b9536f
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
 Kernel Version: 3.10.0-1160.76.1.el7.x86_64
 Operating System: CentOS Linux 7 (Core)
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 11.58GiB
 Name: anchieta
 ID: OZNJ:RKES:NTOH:G37X:4NYO:IJ3U:SKHO:FFG3:RJ7B:GCCJ:XOZN:NRHE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

# uname -r
3.10.0-1160.76.1.el7.x86_64

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Vmware

cluster.yml file:

nodes:
- address: host1
  port: "22"
  internal_address: ""
  role:
  - controlplane
  - etcd
  hostname_override: ""
  user: rancher
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: ~/.ssh/id_rsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []

[...]

services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 0
    gid: 0
    snapshot: null
    retention: ""
    creation: ""
    backup_config: null
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config: null
    audit_log: null
    admission_configuration: null
    event_rate_limit: null
  kube-controller:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
  kubelet:
    image: ""
    extra_args: 
      runtime-request-timeout: 30m
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: false
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    win_extra_args: {}
    win_extra_binds: []
    win_extra_env: []
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
  update_strategy: null
  tolerations: []
authentication:
  strategy: x509
  sans: []
  webhook: null
addons: ""
addons_include: []
#Versões definidas em kubernetes_version
#system_images:
#  etcd: rancher/mirrored-coreos-etcd:v3.5.0
#  alpine: rancher/rke-tools:v0.1.78
#  nginx_proxy: rancher/rke-tools:v0.1.78
#  cert_downloader: rancher/rke-tools:v0.1.78
#  kubernetes_services_sidecar: rancher/rke-tools:v0.1.78
#  kubedns: rancher/mirrored-k8s-dns-kube-dns:1.17.4
#  dnsmasq: rancher/mirrored-k8s-dns-dnsmasq-nanny:1.17.4
#  kubedns_sidecar: rancher/mirrored-k8s-dns-sidecar:1.17.4
#  kubedns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.3
#  coredns: rancher/mirrored-coredns-coredns:1.8.6
#  coredns_autoscaler: rancher/mirrored-cluster-proportional-autoscaler:1.8.5
#  nodelocal: rancher/mirrored-k8s-dns-node-cache:1.21.1
#  #kubernetes: rancher/hyperkube:v1.24.4-rancher1
#  flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
#  calico_node: rancher/mirrored-calico-node:v3.21.1
#  calico_cni: rancher/mirrored-calico-cni:v3.21.1
#  calico_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  calico_ctl: rancher/mirrored-calico-ctl:v3.21.1
#  calico_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  canal_node: rancher/mirrored-calico-node:v3.21.1
#  canal_cni: rancher/mirrored-calico-cni:v3.21.1
#  canal_controllers: rancher/mirrored-calico-kube-controllers:v3.21.1
#  canal_flannel: rancher/mirrored-coreos-flannel:v0.15.1
#  canal_flexvol: rancher/mirrored-calico-pod2daemon-flexvol:v3.21.1
#  weave_node: weaveworks/weave-kube:2.8.1
#  weave_cni: weaveworks/weave-npc:2.8.1
#  pod_infra_container: rancher/mirrored-pause:3.5
#  ingress: rancher/nginx-ingress-controller:nginx-1.1.0-rancher1
#  ingress_backend: rancher/mirrored-nginx-ingress-controller-defaultbackend:1.5-rancher1
#  ingress_webhook: rancher/mirrored-ingress-nginx-kube-webhook-certgen:v1.1.1
#  metrics_server: rancher/mirrored-metrics-server:v0.5.1
#  windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
#  aci_cni_deploy_container: noiro/cnideploy:5.1.1.0.1ae238a
#  aci_host_container: noiro/aci-containers-host:5.1.1.0.1ae238a
#  aci_opflex_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_mcast_container: noiro/opflex:5.1.1.0.1ae238a
#  aci_ovs_container: noiro/openvswitch:5.1.1.0.1ae238a
#  aci_controller_container: noiro/aci-containers-controller:5.1.1.0.1ae238a
#  aci_gbp_server_container: noiro/gbp-server:5.1.1.0.1ae238a
#  aci_opflex_server_container: noiro/opflex-server:5.1.1.0.1ae238a
ssh_key_path: ~/.ssh/id_rsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: null
enable_cri_dockerd: null
kubernetes_version: "v1.24.4-rancher1-1"
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
  update_strategy: null
  http_port: 0
  https_port: 0
  network_mode: ""
  tolerations: []
  default_backend: null
  default_http_backend_priority_class_name: ""
  nginx_ingress_controller_priority_class_name: ""
  default_ingress_class: null
cluster_name: ""
cloud_provider:
  name: ""
prefix_path: ""
win_prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: ""
  ssh_cert: ""
  ssh_cert_path: ""
  ignore_proxy_env_vars: false
monitoring:
  provider: ""
  options: {}
  node_selector: {}
  update_strategy: null
  replicas: null
  tolerations: []
  metrics_server_priority_class_name: ""
restore:
  restore: false
  snapshot_name: ""
rotate_encryption_key: false
dns: null

I would be grateful if this help me and others to solve this annoying issue.

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 5
Comments: 21 (6 by maintainers)

Most upvoted comments

Hi @likku123 @gmanera @iTaybb @vinibodruch Sorry for the late reply, I was out sick for the past two weeks and just returned today. I am glad to see that you guys figured out the root cause and the fix! I can definitely update the cri-dockerd version used in rancher/rke-tools to v0.2.6. I will fit it into the team’s schedule and try to get the fix out ASAP, but sorry that I cannot guarantee a date. Thank you for your understanding.

jiaqiluo on Nov 21, 2022

and a caveat: AFAIK, the kubelet process does not auto-restart when the changes are made in the config file, which means you need to restart the kubelet container after changes are made to the “external” config file.

jiaqiluo on Nov 4, 2022

Where is the kubelet config file on Rancher 2.6.9 - RKE1, like this kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file Can I manage it? Does this file exist?

Rancher/RKE does not use the kubelet config file to configure kubelet, which sadly means you cannot find it anywhere. But it does not mean you cannot use it, and actually, you are very close to the final solution:

You need to set both the extra_args and extra_binds to make it work. In your cluser.yml it will look like the following

services:
  kubelet:
    extra_args:
      config: path-to-the-config-file-in-the-container
    extra_binds:
      - "path-to-file-on-host:path-to-the-config-file-in-the-container"

And of course, you need to create/put such a config file on the control plan node beforehand.

I hope this is helpful.

jiaqiluo on Nov 4, 2022

This is definitely issue with cri-dockerd version which comes along with rke-tools. Right now the version is /opt/rke-tools/bin# ./cri-dockerd --version cri-dockerd 0.2.4 (4b57f30) As per this link https://github.com/kubernetes/minikube/issues/14789#issuecomment-1233138069 cri-dockerd 0.2.6 is the patch which solves the timeout issue.

Any suggestions to deploy cri-dockerd 0.2.6 in my present setup

likku123 on Nov 5, 2022

@likku123 can you do the following checks on the container kube-apiserver in the control plan node:

docker logs to check if there is any error message.
docker exec into the container to see if the config file exists and contains the proper context.
docker inspect to check if --config is set

If all the above look right, it means RKE has configured the kube-apiserver properly, then I will doubt if it is an upstream issue or something wrong outside of RKE.

jiaqiluo on Nov 4, 2022