amazon-eks-ami: PLEG errors with v20221027, kernel 5.4.217-126.408.amzn2
What happened: We noticed an increased rate of PLEG errors while running our workload with v20221027 AMI release. The same issue doesn’t appear with v20220926.
What you expected to happen: PLEG errors shouldn’t appear with latest AMI
How to reproduce it (as minimally and precisely as possible): This issue can be easily reproduced with high CPU/memory workloads
Anything else we need to know?:
So far it looks like an issue associated with the latest kernel version 5.4.217-126.408.amzn2.
Environment:
- AWS Region:
- Instance Type(s):
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): - AMI Version:
- Kernel (e.g.
uname -a): - Release information (run
cat /etc/eks/releaseon a node):
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 41
- Comments: 23 (5 by maintainers)
The problem seems so widespread that AWS should issue an official statement. Can we expect something like that? I learned about this issue only from a support ticket and after some production clusters went down already.
We have been looking into this issue as well in our clusters and have been seeing singular containers on some nodes enter a state where no docker commands complete when interacting with the container. As mentioned, this was first seen in the
kernel-5.4.214-120.368.amzn2version. I’ll post some of our investigation here in case it is useful.Node events show flapping between NodeReady and NodeNotReady
That NodeNotReady status seems to come from the K8s PLEG failing
All docker commands with affected pod hang, but others complete just fine
The affected container seems to have an additional
runcprocess compared to other working containers.The
straceof that runc processWe then see that after running
straceon thatruncprocess, the container is automatically cleaned up and the node seems to go back to a Ready state.At this point I hopped on another broken node since that one was cleaned up.
Running strace on the containerd shim process
That then loops through the
FUTEX_WAIT_PRIVATEcalls indefinitely.Generating stack trace of docker which generates an extremely large file
Inside the stack trace we see what seems like thousands of goroutines similar to
The oldest of these
semacquirefor this was 834 minutes so we grabbed the other goroutines in that dump that were at that timeThe two
syscallroutines there show thisAs far as I found there was now way to display what file was actually being opened with that
syscall.Openand this is where we discovered the kernel bump in our AMI and eventually to this issue and the corresponding comment in aws repo pinning the kernel version https://github.com/awslabs/amazon-eks-ami/blob/1e89a4483c5bd2c34fcdf89d625fbd56085d83cf/scripts/upgrade_kernel.sh#L16 .In order to trigger this issue we’ve been executing a
kubectl rollout restart ...of one of our daemonsets to tear down a pod on every node.Hopefully some of this may be useful in tracking down the issue.
AmazonLinux has released kernel
5.4.219-126.411.amzn2yesterday, which we expect to solve the issues described here. We’ve tested it with our reproduction of the issue, and it seems to work. To solve this issue, you can do one of the following:v20221101does not have the problematic kernel patch, so it should not have the same issue reported here. It’s just running an older kernel version, so switching to that AMI is another option.v20221103or later. We will update this repo as soon as those AMIs are released. At the moment, it’s looking like 11/7/22 is when they’ll be available.v20221101, you’ll need to unpin the kernel. I don’t recommend this approach unless it’s very important that you get on the latest kernel.@ravisinha0506 since you’re tracking this, we’re seeing the same issues on AMIs built on
kernel-5.4.214-120.368.amzn2as wellhttps://alas.aws.amazon.com/AL2/ALASKERNEL-5.4-2022-036.html
@kevinsimons-wf I have been seeing similar issues. Also followed a similar debug process. But I also have seen the issue happening without any container in a bad state.
In all cases tho (maybe you can check it out), before we start seeing the PLEG error we see a bunch of log lines containing the string
fsHandler.go:119] failed to collect filesystem stats - rootDiskErr: could not stat. right after those PLEG error appears.Finally (I have not been able to verify it is the cause, but it could make sense) this error seems to happen within under 10 minutes from a new network interface is added, see
dmesg -Tfor lines such as :It could happen that there was a blip on the network that caused kubelet to fail reading the disk and entered in this bad state. At the same time, the same blip could be causing other processes to enter in bad state (I actually had a bunch of aws-node subprocesses in
Zstate)Same as you, hopefully this might help solve the issue…
Sorry, I was OOTO, so wasn’t able to update here. v20221104 was released with the latest kernel,
5.4.219-126.411.amzn2, which resolves this issue.When issues like this happen, it is our standard practice to root cause the issue, mitigate and review our processes to avoid similar issues in the future. In this case, the process is ongoing. We have developed a test that reproduces this particular issue and have added that as part of our AMI release process. As for more details on the root cause, I’ll post here if AmazonLinux has released anything with more details. In a nutshell, a commit on the kernel was causing some workloads to hang, which was reverted on the latest kernel.
Same here…
Prior to this latest release we attempted to upgrade the kernel on top of the AMI Release v20220926. We experienced instability when doing this. So we decided to wait for this latest release with the kernel upgrades that patch the latest CVEs. Unfortunately we experienced the same issues. We are reasonably confident this is due to the kernel upgrade.
The issues that we are seeing now are mostly related to readiness probes using exec commands. They are failing at a much higher rate. It seems to be either a latency issue or some sort of race condition causing the failures. It also noteworthy we see these failures on high volume pods/nodes.
Thanks for the quick reply. I didn’t see those processes on my nodes. Today I installed an efs-csi driver update that was quickly updated to a new release today, perhaps that was causing my issues. I spun up all new nodes in my clusters and things seem stable now.
Is it written anywhere what was the root cause and what was the fix? I’m interested.
5.4.219-126.411.amzn2seems available with release v20221104The latest AMI doesn’t have this kernel 5.4.219-126.411.amzn2.