rancher: PLEG is not healthy K8 1.20.4/Ubuntu 20.04

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • Install Rancher (2.5.7)
  • Launch a RKE cluster with Kubernetes 1.20.4
  • Use Ubuntu 20.04 on Nodes
  • Deploy many statefulsets (+20) and some cron jobs each minutes

Result: Nodes start to fall into PLEG error.

Other details that may be helpful: After a downgrade (full reinstall) to K8 1.18.16 and Ubuntu 18.04 everything works correctly. This issue seems reproducible since i’ve done a full reinstall twice with K8 1.20.4 and Ubuntu 20.04 and the issue still persist. Also i see the issue on my 3 nodes (all nodes have Ubuntu 20.04)

Node flapping seems to come from Statefulset and CronJobs. This command is useful to find the container that hangs on docker inspect: docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null'

Environment information

  • Rancher version 2.5.7:
  • Installation option (single install/HA): HA, Digital Ocean Kubernetes

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): Custom (Launched by Rancher)

  • Machine type (cloud/VM/metal) and specifications (CPU/memory): metal , 8 CPU, 32 GB RAM on each node, 3 nodes, OVH VPS provider

  • Kubernetes version (use kubectl version): 1.20.4

  • Docker version (use docker version): Issue is reproducible on 19.03 and 20.10 (installed via Rancher scripts), i think that docker is not involved here in the issue.

Related comment on kubernetes repo: https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-803358917

Note: I’m sorry but I no longer have access to the mentioned cluster so i can’t provide more information/logs.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 32

Most upvoted comments

easy to solve it: not use runc-1.0.0-rc93 upgrade or downgrade any version, but not 1.0.0-rc93 look: deadlock of read from closed pip

  • Kubernetes: 1.19.8-rancher1-1
  • Node OS: Ubuntu 20.04
  • Docker: 20.10.3 (also tried 20.10.5 at some point)
  • Provider: EC2 on AWS

We worked around the PLEG issue by keeping the versions as seen in the quote above but we then downgraded containerd from 1.4.4 to 1.4.3.

From “System information” in Rancher on one of our nodes:

Architecture | amd64
Docker Version | 20.10.3
Kernel Version | 5.4.0-1029-aws
Kubelet Version | v1.20.5
Kube Proxy Version | v1.20.5
Operating System Image | Ubuntu 20.04.1 LTS
Operating System | linux

From terminal on the same node:

someuser@somenode:~$ containerd --version
containerd containerd.io 1.4.3 269548fa27e0089a8b8278fc4fc781d7f65a939b

The nodes have been running for several days now without any issues. Before we used to run into the PLEG issue after only a few minutes.

@rdxmb We upgraded to docker 20.10.6 which stabilized K8s and allowed us to stay on Ubuntu 20.04.

Same issue here, docker inspect hangs and PLEG on nodes. Reproducible on: Docker Server Version: 19.03.14 OS: CentOS Linux release 7.6.1810 (Core) and 7.9 Rancher: 2.5.5 and 2.5.7 ( tried both)

PLEG appears most often on Kubernetes 1.20 and when the max number of pods per node is increased.

It’s a critical issue and affects the stability of production systems. Are there any workaround suggested?

Looks like related to this one here

@herrenP did you try to deploy new nodes with Ubuntu 18 ?

Also since it seems reproducible in your env, could you run docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null' on stucked nodes to identify the container that hangs ?

The command will hang on the container id that block the PLEG process. Then search the container name with docker ps

If i remember in my case it was a Rancher/pause container.

We tried that command earlier today and in all cases it was the Rancher/pause container.

@herrenP did you try to deploy new nodes with Ubuntu 18 ?

Also since it seems reproducible in your env, could you run docker ps -a | tr -s " " | cut -d " " -f1 | xargs -Iarg sh -c 'echo arg; docker inspect arg> /dev/null' on stucked nodes to identify the container that hangs ?

The command will hang on the container id that block the PLEG process. Then search the container name with docker ps

If i remember in my case it was a Rancher/pause container.

Happens to us too. Rancher 2.5.5, Ubuntu 20.04. Seen the issue on nodes running ~100 to ~200 pods with maxPods set to 250. Restarting docker or rebooting the nodes does not help. They return to this state after a short while.