amazon-eks-ami: Node NotReady because of PLEG is not healthy

What happened: One of the nodes using the latest AMI version started to become NotReady

What you expected to happen: The node is always ready

How to reproduce it (as minimally and precisely as possible): Not sure how, the node never get any CPU or Memory usage high

Anything else we need to know?: We ssh to the nodes and found out that the PLEG is not healthy

Feb 20 14:23:09 ip-10-0-13-15.eu-west-1.compute.internal kubelet[3694]: I0220 14:23:09.120100    3694 kubelet.go:1775] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 4h19m47.369998188s ago; threshold is 3m0s]

and when we try to check the container with docker ps we get this error:

Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: http: multiple response.WriteHeader calls
Feb 20 14:28:54 ip-10-0-13-15.eu-west-1.compute.internal dockerd[3188]: time="2019-02-20T14:28:54.455014979Z" level=error msg="Handler for GET /v1.25/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"

this is the docker version we use

[root@ip-10-0-13-15 ~]# docker version
Client:
 Version:      17.06.2-ce
 API version:  1.30
 Go version:   go1.9.6
 Git commit:   3dfb8343b139d6342acfd9975d7f1068b5b1c3d3
 Built:        Mon Jan 28 22:06:48 2019
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.2-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.9.6
 Git commit:   402dd4a/17.06.2-ce
 Built:        Mon Jan 28 22:07:35 2019
 OS/Arch:      linux/amd64
 Experimental: false
[root@ip-10-0-13-15 ~]# docker info
Containers: 13
 Running: 12
 Paused: 0
 Stopped: 1
Images: 37
Server Version: 17.06.2-ce
Storage Driver: overlay2
 Backing Filesystem: xfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.94-89.73.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.503GiB
Name: ip-10-0-13-15.eu-west-1.compute.internal
ID: PVWT:EV6L:L543:5IU4:WIAB:IZPK:FIAE:3LLA:WV7F:GG5V:XRKW:JA4S
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: true

We tried to forked and made some changes https://github.com/tiqets/amazon-eks-ami, we’ve updated the docker version and it seems to be working, do you guys think this is related with docker?

Environment:

  • AWS Region: eu-west-1
  • Instance Type(s): M5.Larget
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.11
  • AMI Version: amazon-eks-node-1.11-v20190211 (ami-0b469c0fef0445d29)
  • Kernel (e.g. uname -a): Linux ip-10-0-13-15.eu-west-1.compute.internal 4.14.94-89.73.amzn2.x86_64 #1 SMP Fri Jan 18 22:36:02 UTC 2019 x86_64 x86_64 x86_64 G NU/Linux
  • Release information (run cat /tmp/release on a node):
empty

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 11
  • Comments: 29 (2 by maintainers)

Most upvoted comments

Agreed, but this is present in most of the k8s versions now and even though there is no fix yet, I want to know what work-arounds, if any, does EKS have in place to solve this problem. This definitely affects k8s deployments and as a product EKS is directly affected.

Note: In our cluster, I have debugged this a lot before finding the issues online, and can say definitively that it is not related to CPU, memory, network or disk issues. It might be related to too many events too quickly.

I was experiencing the same, many nodes flapping NotReady in my cluster after upgrading to EKS 1.19. Running m5.4xlarge with ami v20210329 and average of 50 pods per node. Confirmed this ami had the version of runC 1.0.0-rc93.

Problem seems resolved since updating worker nodes with the AMI Release v20210414!

These settings should have default values. It is just asking for outages to not protect the kubelet.

The following configuration [passed to worker node userdata bootstrap] worked for me. I used to face this often but haven’t had this issue in months now:

--kube-reserved cpu=250m, memory=0.5Gi,ephemeral-storage=1Gi \
--system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \
--eviction-hard memory.available<300Mi,nodefs.available<10%

@PDRMAN Hopefully will do the same for you and others.

Hi all, be aware that component “runC” (1.0.0-rc93) of “containerd.io” which is used by docker will give you PLEG issues and node flapping between ready and not ready. I hope noone else will loose a ton of hours to find out the problem 🙂 Use another version of it, for example 1.0.0-rc92.

@yuzujoe your issue could be related to #648 on which case reverting to 1.18 node or an older 1.19 AMI may help.

There are 200+ issues in the k8s project for PLEG issues. Mostly attributed to PLEG resource exhaustion, either out of memory, or too many events too quickly, and some deadlock situations. I have seen it once and ended up restarting since there was no other option I could find.

it seems to be related to https://github.com/kubernetes/kubernetes/issues/45419 if your nodes are flapping between Ready/NotReady?