amazon-eks-ami: Nodes become unresponsive and doesnt recover with soft lockup error
Status Summary
This section added by @mmerkes
The AWS EKS team has a consistent repro and has engaged with the AmazonLinux team to root cause the issue and fix it. AmazonLinux has merged a patch that solves this issue in our repro and should work for customers, and that is now available via yum update. Once you’ve updated your kernel and rebooted your instance, you should be running kernel.x86_64 0:4.14.203-156.332.amzn2 (or greater). All EKS optimized AMIs Packer version v20201112 or later will include this patch. Users have 2 options for fixing this issue:
- Upgrade to nodes to use the latest EKS optimized AMI
- Patch your nodes with
yum update
Here’s the commands you need to patch your instances:
sudo yum update kernel
sudo reboot
Original Issue
This original content from @cshivashankar
What happened: Node in the cluster becomes unresponsive and pods running on it also becomes unresponsive. As per the analysis and logs provided in AWS Case 6940959821, it was informed that this is observed when high IOPS is observed and a soft lock up happens which causes node to become unresponsive. Further investigation might be required .
What you expected to happen: Node should not crash or become unresponsive , if that was the case , control plane should identify it and mark it as not ready. State should be either node is ready and working properly or node is unresponsive and not ready and should be eventually removed from the cluster.
How to reproduce it (as minimally and precisely as possible): As per the analysis in the AWS case 6940959821 , the issue could be reproduced by having higher IOPS than the capacity of EBS for sustained amount of time.
Anything else we need to know?:
This issue is being observed recently and I want to rule it out if it was due to using AMI of version 1.14 as we never observed this issue in 1.13. Is there any kernel bug that I am hitting into? For building the AMI, I cloned the “amazon/aws-eks-ami” repo and did the following changes
1. Installed Zabbix agent
2. Ran the kubelet with “–allow-privileged=true” flag as I was getting issues with cadvisor.
So basically AMI being used is practically the same as AWS EKS AMI.
Changes mentioned in the following comment
Logs can be accessed in the AWS Case mentioned above Environment:
- AWS Region: us-east-1
- Instance Type(s): r5 , c5 types
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): “eks.9” - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): “1.14” - AMI Version: 1.14
- Kernel (e.g.
uname -a): Linux <IP ABSTRACTED> 4.14.165-133.209.amzn2.x86_64 #1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux - Release information (run
cat /etc/eks/releaseon a node):
$ cat /etc/eks/release
BASE_AMI_ID="ami-08abb3d74e734d551"
BUILD_TIME="Mon Mar 2 17:21:42 UTC 2020"
BUILD_KERNEL="4.14.165-131.185.amzn2.x86_64"
ARCH="x86_64"
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 10
- Comments: 83 (26 by maintainers)
We’ve been able to root cause the issue with the AmazonLinux team.
When the containers have a write heavy workload which run on IOPS constrained EBS volumes, EBS starts to throttle IOPS. The kernel is unable to flush the dirty pages to disk because of this throttling. The dirty pages limit is even more constrained for each cgroup/container and is directly proportional to the memory requested by the container.
When number of dirty pages for a container increases, the kernel tries to flush the pages to disk. In 4.14 kernel, the code which flushes these pages to disk does wasteful work in building up the queue
work itemsof pages to flush instead of actually flushing them. This causes the soft lockup errors and explains why we don’t see any I/O going to disk during this event. We have found the patch in kernel 4.15.x and onwards which fixes this issue.We are working on backporting this patch to 4.14 kernel so it can be released with an EKS optimized AMI.
Hey!
A small update - we are running the solution on production for several weeks now and the node freeze problem never appeared.
We made several improvements and learned from mistakes:
We are using the terraform AWS EKS module to create a cluster. Here is how our final solution looks like now:
lib/cluster-custom-userdata.tpl:
cluster.tf:
Huge kudos to AWS engineer @raonitimo who was assisting us with this issue, working with him was a pleasure and the most enjoyable experience with AWS support by far.
Regards, Deniss Rostkovskis
I’ve shared my team’s reproducer with the AWS team. I cannot share the gzipped tarball itself here, but it’s essentially a few hundred mb of small files, in many directories. Running 8 replicas of
tarextraction in a tight loop reliably produces the lockup on m5.2xl instances running 4.14, but not 5.4.We had the same problem upgrading our worker nodes to Amazon EKS 1.15 AMI. We tried:
amazon-eks-node-1.15-v20200507amazon-eks-node-1.15-v20200423and both had the same problem.
We have pods with
initContainerscopying about 1Gb of small files (WP install), and during the copy, in theInitphase, the worker nodes hang, becoming completely unresponsive.Syslog on the worker node reports:
As workaround, we took the offical Ubuntu EKS 1.15 image
ami-0f54d80ab3d460266, we added in it thenfs-commonpackage to manage the EFS and we rebuilt a new custom AMI from it.Note: Using
kiam, we had to change the CA certificate path, because the location in Ubuntu is different from the one in AmazonLinux image.@rphillips @cshivashankar Here is a thread that discusses the issue and I believe this is the upstream patch that resolves the issue.
The reproducer runs 8 replicas of a container which downloads and extracts a 22mb (277mb uncompressed) gzipped-tarball in a tight loop. At first I figured this might be related to the use of an emptyDir, as most of the pods which seemed to trigger this issue used emptyDirs, but the reproducer was able to reproduce the node failures both with and without emptyDirs. I was also able to reproduce the lockup behavior on m5.2xls and m4.2xls, although the log messages were different between the two.
Upgrading the kernel from 4.14 to 5.4 to seems to resolve the issue for customers. The one risk of the
amazon-linux-extras install kernel-ngmethod is thatkernel-ngrefers to the next generation kernel, so it’s not going to guarantee you’ll always be on the 5.4 kernel. At some point in the future, it could get bumped to 5.9. Also, EKS doesn’t run conformance tests on the 5.4 kernel, so we can’t provide the same guarantees, though it’s officially supported by AmazonLinux.We’ve been struggling to get a repro of the issue, but we’ve been working with the AmazonLinux team and have found a few kernel patches that we suspect may fix the issue. We’re working on a plan to get those out and figuring out how to test if they resolve the issue or not. I will post an update here when I have more information.
@deniss-rostkovskis We think we’re hitting this issue too in our prod cluster. Are there any telltale metrics or logs that we could look for to confirm it’s almost certainly the same thing? (there’s quite a bit on this thread now so wanted to confirm)
This seems to work after testing it a few times.
@cshivashankar for cost reasons [or our perceptions of the costs], we didn’t pursue IOPS-provisioned EBS volumes, so I don’t have any data to provide. I.e. we were previously just using vanilla ‘gp2’ drives.
I think that one could use IOPS-provisioned (i.e. “io1”) drives instead of following our present solution. That would be a bit simpler.
Ephemeral SSDs associated with ephemeral instance like our use-case is a nice match. It seems likely that our (new) SSD drives are overprovisioned w.r.t size, in light of the available CPU/RAM/SSD size ratios. Other users might use more or less disk than us.
BTW we down-sized the root EBS partition to 20GB but even 10GB would probably be sufficient.