rke: Pods do not start on upgrade from 1.24.8-rancher1-1 to 1.24.9-rancher1-1

RKE version: 1.4.2 Docker version: (docker version,docker info preferred)

> docker --version
Docker version 20.10.21, build baeda1f
> docker --info
docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
  compose: Docker Compose (Docker Inc., v2.15.1)
  scan: Docker Scan (Docker Inc., v0.23.0)

Server:
 Containers: 51
  Running: 7
  Paused: 0
  Stopped: 44
 Images: 40
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 5b842e528e99d4d4c1686467debf2bd4b88ecd86
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-58-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.7GiB
 Name: k8-internal-000001
 ID: 6ATJ:2X3N:RNNU:MF66:TSKE:T5CU:4KCU:SOQJ:66VS:RPNQ:WI3V:FEUV
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

~$ uname -r
5.15.0-58-generic

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Hyper-V Ubuntu 22.04 nodes

cluster.yml file:

kubernetes_version: v1.24.8-rancher1-1
nodes:
    # k8-internal-000001 
    - address: 192.168.1.xx
      user: nodeuser
      role:
        - worker
        - etcd
        - controlplane
    # k8-internal-000002
    - address: 192.168.1.xx
      user: nodeuser
      role:
        - worker
        - etcd
    # k8-internal-000003
    - address: 192.168.1.xx
      user: nodeuser
      role:
        - worker
        - etcd
    # k8-internal-000004
    - address: 192.168.1.xx
      user: nodeuser
      role:
        - worker

# Cluster level SSH private key
# Used if no ssh information is set for the node
ssh_key_path: ~/.ssh/id_ed25519

services:
  kube-api:
    secrets_encryption_config:
      enabled: true
      
# Set the name of the Kubernetes cluster  
cluster_name: internal

# Specify network plugin-in (canal, calico, flannel, weave, or none)
network:
    plugin: flannel

ingress:
  provider: none

authentication:
  strategy: x509
  sans:
    - 'cp-internal.local.net'
    - 'cp-internal'
    - '127.0.0.1'
    - 'localhost'
    - 'kubernetes'
    - 'kubernetes.default'
    - 'kubernetes.default.svc'
    - 'kubernetes.default.svc.cluster.local'

dns:
  provider: coredns
  upstreamnameservers:
   - 192.168.1.xx

Steps to Reproduce: I continue to see issues in trying to upgrade clusters, even small jumps. Example: I have two clusters running 1.24.8-rancher1-1 and I attempted to upgrade to 1.24.9-rancher1-1 using RKE 1.4.2. The rke up command runs successfully, but every pod basically goes into an error state with errors such as this:

unable to ensure pod container exists: failed to create container for [kubepods besteffort pod3544c64a-54a2-472a-9d70-39328bde7a0a] : unable to start unit "kubepods-besteffort-pod3544c64a_54a2_472a_9d70_39328bde7a0a.slice" (properties [{Name:Description Value:"libcontainer container kubepods-besteffort-pod3544c64a_54a2_472a_9d70_39328bde7a0a.slice"} {Name:Wants Value:["kubepods-besteffort.slice"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]): Unit kubepods-besteffort.slice not found.

Results:

The only success I have had in the past is to provision new nodes and then move them, which i usually do in multiple rke up commands. If I sneak a k8 upgrade in when I’m provisioning new nodes and then get rid of the old ones, the cordon/drain process seems to restart everything…

This is really getting worrisome, as I’m unable to perform a simple upgrade of the cluster.

The only success I have had is to provision new nodes and then move them, which i usually do in multiple rke up commands. If I sneak a k8 upgrade in when I’m provisioning new nodes and then get rid of the old ones, the cordon/drain process seems

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 6
  • Comments: 27 (5 by maintainers)

Most upvoted comments

My basic ways to reproduce this have not lead to success, if there is anyone who can reproduce this with a stock cloud image, can you please share which cloud and which image so I can use it? I have used Ubuntu 22.04 on AWS and Debian 11 on DigitalOcean without success. As not everyone is hitting this issue, there needs to be some specific software version/configuration/deployment present that causes this issue.

If you don’t have a stock cloud image to reproduce, please share the following outputs:

  • cat /proc/self/mountinfo | grep cgroup
  • systemd --version
  • containerd --version
  • runc --version

If you can only reproduce on your own infra, maybe you can add verbose logging to the kubelet (-v 9) to extract more info on what is happening exactly.

Same issue with v1.24.10-rancher4-1, any kubelet restart causes errors in kubelet log E0426 12:38:36.926506 360191 qos_container_manager_linux.go:374] "Failed to update QoS cgroup configuration" err="unable to set unit properties: Unit kubepods.slice not found." I0426 12:38:36.926541 360191 kubelet.go:1658] "Failed to update QoS cgroups while syncing pod" pod="argocd/argocd-redis-68bc48958b-sdvn7" err="unable to set unit properties: Unit kubepods.slice not found." E0426 12:38:36.930385 360191 pod_workers.go:965] "Error syncing pod, skipping" err="failed to ensure that the pod: 828ad3df-a0df-46a0-bee8-53576030ff50 cgroups exist and are correctly applied: failed to create container for [kubepods besteffort pod828ad3df-a0df-46a0-bee8-53576030ff50] : unable to start unit \"kubepods-besteffort-pod828ad3df_a0df_46a0_bee8_53576030ff50.slice\" (properties [{Name:Description Value:\"libcontainer container kubepods-besteffort-pod828ad3df_a0df_46a0_bee8_53576030ff50.slice\"} {Name:Wants Value:[\"kubepods-besteffort.slice\"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]): Unit kubepods-besteffort.slice not found." pod="argocd/argocd-redis-68bc48958b-sdvn7" podUID=828ad3df-a0df-46a0-bee8-53576030ff50 E0426 12:38:37.025963 360191 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/kubepods.slice\": failed to get container info for \"/kubepods.slice\": unknown container \"/kubepods.slice\"" containerName="/kubepods.slice" Issue is present using both docker-ce 20.10.24 and docker-ce 23.0.4 systemctl restart docker is a workaround.

This also happens if I just change anything in the cluster config, say the vsphere CSI password. docker restart kubelet fixes it realiably, so far, but I have to do that on each node.

@kinarashah https://gist.github.com/electrical/76f8567b1243320829704729e7b40da7

@kinarashah any update after supplying the logs? I’m a bit worried that the thread has been so quiet while this seems to be a major issue.