rancher: PLEG is not healthy K8 1.20.4/Ubuntu 20.04
What kind of request is this (question/bug/enhancement/feature request): bug
Steps to reproduce (least amount of steps as possible):
- Install Rancher (2.5.7)
- Launch a RKE cluster with Kubernetes 1.20.4
- Use Ubuntu 20.04 on Nodes
- Deploy many statefulsets (+20) and some cron jobs each minutes
Result: Nodes start to fall into PLEG error.
Other details that may be helpful: After a downgrade (full reinstall) to K8 1.18.16 and Ubuntu 18.04 everything works correctly. This issue seems reproducible since i’ve done a full reinstall twice with K8 1.20.4 and Ubuntu 20.04 and the issue still persist. Also i see the issue on my 3 nodes (all nodes have Ubuntu 20.04)
Node flapping seems to come from Statefulset and CronJobs.
This command is useful to find the container that hangs on docker inspect
: docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null'
Environment information
- Rancher version 2.5.7:
- Installation option (single install/HA): HA, Digital Ocean Kubernetes
Cluster information
-
Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom (Launched by Rancher)
-
Machine type (cloud/VM/metal) and specifications (CPU/memory): metal , 8 CPU, 32 GB RAM on each node, 3 nodes, OVH VPS provider
-
Kubernetes version (use
kubectl version
): 1.20.4 -
Docker version (use
docker version
): Issue is reproducible on 19.03 and 20.10 (installed via Rancher scripts), i think that docker is not involved here in the issue.
Related comment on kubernetes repo: https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-803358917
Note: I’m sorry but I no longer have access to the mentioned cluster so i can’t provide more information/logs.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 32
easy to solve it:
not use runc-1.0.0-rc93
upgrade or downgrade any version, but not 1.0.0-rc93 look: deadlock of read from closed pipWe worked around the PLEG issue by keeping the versions as seen in the quote above but we then downgraded
containerd
from 1.4.4 to 1.4.3.From “System information” in Rancher on one of our nodes:
From terminal on the same node:
The nodes have been running for several days now without any issues. Before we used to run into the PLEG issue after only a few minutes.
@rdxmb We upgraded to docker 20.10.6 which stabilized K8s and allowed us to stay on Ubuntu 20.04.
Same issue here, docker inspect hangs and PLEG on nodes. Reproducible on: Docker Server Version: 19.03.14 OS: CentOS Linux release 7.6.1810 (Core) and 7.9 Rancher: 2.5.5 and 2.5.7 ( tried both)
PLEG appears most often on Kubernetes 1.20 and when the max number of pods per node is increased.
It’s a critical issue and affects the stability of production systems. Are there any workaround suggested?
Looks like related to this one here
We tried that command earlier today and in all cases it was the Rancher/pause container.
@herrenP did you try to deploy new nodes with Ubuntu 18 ?
Also since it seems reproducible in your env, could you run
docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null'
on stucked nodes to identify the container that hangs ?The command will hang on the container id that block the PLEG process. Then search the container name with
docker ps
If i remember in my case it was a Rancher/pause container.
Happens to us too. Rancher 2.5.5, Ubuntu 20.04. Seen the issue on nodes running ~100 to ~200 pods with maxPods set to 250. Restarting docker or rebooting the nodes does not help. They return to this state after a short while.