rke: Pods do not start on upgrade from 1.24.8-rancher1-1 to 1.24.9-rancher1-1
RKE version:
1.4.2
Docker version: (docker version,docker info preferred)
> docker --version
Docker version 20.10.21, build baeda1f
> docker --info
docker info
Client:
Context: default
Debug Mode: false
Plugins:
app: Docker App (Docker Inc., v0.9.1-beta3)
buildx: Docker Buildx (Docker Inc., v0.9.1-docker)
compose: Docker Compose (Docker Inc., v2.15.1)
scan: Docker Scan (Docker Inc., v0.23.0)
Server:
Containers: 51
Running: 7
Paused: 0
Stopped: 44
Images: 40
Server Version: 20.10.21
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runtime.v1.linux runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 5b842e528e99d4d4c1686467debf2bd4b88ecd86
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.0-58-generic
Operating System: Ubuntu 22.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.7GiB
Name: k8-internal-000001
ID: 6ATJ:2X3N:RNNU:MF66:TSKE:T5CU:4KCU:SOQJ:66VS:RPNQ:WI3V:FEUV
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Operating system and kernel: (cat /etc/os-release, uname -r preferred)
~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
~$ uname -r
5.15.0-58-generic
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) Hyper-V Ubuntu 22.04 nodes
cluster.yml file:
kubernetes_version: v1.24.8-rancher1-1
nodes:
# k8-internal-000001
- address: 192.168.1.xx
user: nodeuser
role:
- worker
- etcd
- controlplane
# k8-internal-000002
- address: 192.168.1.xx
user: nodeuser
role:
- worker
- etcd
# k8-internal-000003
- address: 192.168.1.xx
user: nodeuser
role:
- worker
- etcd
# k8-internal-000004
- address: 192.168.1.xx
user: nodeuser
role:
- worker
# Cluster level SSH private key
# Used if no ssh information is set for the node
ssh_key_path: ~/.ssh/id_ed25519
services:
kube-api:
secrets_encryption_config:
enabled: true
# Set the name of the Kubernetes cluster
cluster_name: internal
# Specify network plugin-in (canal, calico, flannel, weave, or none)
network:
plugin: flannel
ingress:
provider: none
authentication:
strategy: x509
sans:
- 'cp-internal.local.net'
- 'cp-internal'
- '127.0.0.1'
- 'localhost'
- 'kubernetes'
- 'kubernetes.default'
- 'kubernetes.default.svc'
- 'kubernetes.default.svc.cluster.local'
dns:
provider: coredns
upstreamnameservers:
- 192.168.1.xx
Steps to Reproduce:
I continue to see issues in trying to upgrade clusters, even small jumps.
Example:
I have two clusters running 1.24.8-rancher1-1 and I attempted to upgrade to 1.24.9-rancher1-1 using RKE 1.4.2. The rke up command runs successfully, but every pod basically goes into an error state with errors such as this:
unable to ensure pod container exists: failed to create container for [kubepods besteffort pod3544c64a-54a2-472a-9d70-39328bde7a0a] : unable to start unit "kubepods-besteffort-pod3544c64a_54a2_472a_9d70_39328bde7a0a.slice" (properties [{Name:Description Value:"libcontainer container kubepods-besteffort-pod3544c64a_54a2_472a_9d70_39328bde7a0a.slice"} {Name:Wants Value:["kubepods-besteffort.slice"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]): Unit kubepods-besteffort.slice not found.
Results:
The only success I have had in the past is to provision new nodes and then move them, which i usually do in multiple rke up commands. If I sneak a k8 upgrade in when I’m provisioning new nodes and then get rid of the old ones, the cordon/drain process seems to restart everything…
This is really getting worrisome, as I’m unable to perform a simple upgrade of the cluster.
The only success I have had is to provision new nodes and then move them, which i usually do in multiple rke up commands. If I sneak a k8 upgrade in when I’m provisioning new nodes and then get rid of the old ones, the cordon/drain process seems
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 6
- Comments: 27 (5 by maintainers)
My basic ways to reproduce this have not lead to success, if there is anyone who can reproduce this with a stock cloud image, can you please share which cloud and which image so I can use it? I have used Ubuntu 22.04 on AWS and Debian 11 on DigitalOcean without success. As not everyone is hitting this issue, there needs to be some specific software version/configuration/deployment present that causes this issue.
If you don’t have a stock cloud image to reproduce, please share the following outputs:
cat /proc/self/mountinfo | grep cgroupsystemd --versioncontainerd --versionrunc --versionIf you can only reproduce on your own infra, maybe you can add verbose logging to the kubelet (
-v 9) to extract more info on what is happening exactly.Same issue with v1.24.10-rancher4-1, any kubelet restart causes errors in kubelet log
E0426 12:38:36.926506 360191 qos_container_manager_linux.go:374] "Failed to update QoS cgroup configuration" err="unable to set unit properties: Unit kubepods.slice not found." I0426 12:38:36.926541 360191 kubelet.go:1658] "Failed to update QoS cgroups while syncing pod" pod="argocd/argocd-redis-68bc48958b-sdvn7" err="unable to set unit properties: Unit kubepods.slice not found." E0426 12:38:36.930385 360191 pod_workers.go:965] "Error syncing pod, skipping" err="failed to ensure that the pod: 828ad3df-a0df-46a0-bee8-53576030ff50 cgroups exist and are correctly applied: failed to create container for [kubepods besteffort pod828ad3df-a0df-46a0-bee8-53576030ff50] : unable to start unit \"kubepods-besteffort-pod828ad3df_a0df_46a0_bee8_53576030ff50.slice\" (properties [{Name:Description Value:\"libcontainer container kubepods-besteffort-pod828ad3df_a0df_46a0_bee8_53576030ff50.slice\"} {Name:Wants Value:[\"kubepods-besteffort.slice\"]} {Name:MemoryAccounting Value:true} {Name:CPUAccounting Value:true} {Name:IOAccounting Value:true} {Name:TasksAccounting Value:true} {Name:DefaultDependencies Value:false}]): Unit kubepods-besteffort.slice not found." pod="argocd/argocd-redis-68bc48958b-sdvn7" podUID=828ad3df-a0df-46a0-bee8-53576030ff50 E0426 12:38:37.025963 360191 summary_sys_containers.go:48] "Failed to get system container stats" err="failed to get cgroup stats for \"/kubepods.slice\": failed to get container info for \"/kubepods.slice\": unknown container \"/kubepods.slice\"" containerName="/kubepods.slice"Issue is present using both docker-ce 20.10.24 and docker-ce 23.0.4systemctl restart dockeris a workaround.This also happens if I just change anything in the cluster config, say the vsphere CSI password.
docker restart kubeletfixes it realiably, so far, but I have to do that on each node.@kinarashah any update after supplying the logs? I’m a bit worried that the thread has been so quiet while this seems to be a major issue.