amazon-eks-ami: Nodes become unresponsive and doesnt recover with soft lockup error

Status Summary

This section added by @mmerkes

The AWS EKS team has a consistent repro and has engaged with the AmazonLinux team to root cause the issue and fix it. AmazonLinux has merged a patch that solves this issue in our repro and should work for customers, and that is now available via yum update. Once you’ve updated your kernel and rebooted your instance, you should be running kernel.x86_64 0:4.14.203-156.332.amzn2 (or greater). All EKS optimized AMIs Packer version v20201112 or later will include this patch. Users have 2 options for fixing this issue:

Upgrade to nodes to use the latest EKS optimized AMI
Patch your nodes with yum update

Here’s the commands you need to patch your instances:

sudo yum update kernel
sudo reboot

Original Issue

This original content from @cshivashankar

What happened: Node in the cluster becomes unresponsive and pods running on it also becomes unresponsive. As per the analysis and logs provided in AWS Case 6940959821, it was informed that this is observed when high IOPS is observed and a soft lock up happens which causes node to become unresponsive. Further investigation might be required .

What you expected to happen: Node should not crash or become unresponsive , if that was the case , control plane should identify it and mark it as not ready. State should be either node is ready and working properly or node is unresponsive and not ready and should be eventually removed from the cluster.

How to reproduce it (as minimally and precisely as possible): As per the analysis in the AWS case 6940959821 , the issue could be reproduced by having higher IOPS than the capacity of EBS for sustained amount of time.

Anything else we need to know?: This issue is being observed recently and I want to rule it out if it was due to using AMI of version 1.14 as we never observed this issue in 1.13. Is there any kernel bug that I am hitting into? For building the AMI, I cloned the “amazon/aws-eks-ami” repo and did the following changes ~~1. Installed Zabbix agent 2. Ran the kubelet with “–allow-privileged=true” flag as I was getting issues with cadvisor. So basically AMI being used is practically the same as AWS EKS AMI.~~ Changes mentioned in the following comment

Logs can be accessed in the AWS Case mentioned above Environment:

AWS Region: us-east-1
Instance Type(s): r5 , c5 types
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): “eks.9”
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): “1.14”
AMI Version: 1.14
Kernel (e.g. uname -a): Linux <IP ABSTRACTED> 4.14.165-133.209.amzn2.x86_64 #1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

$ cat /etc/eks/release
BASE_AMI_ID="ami-08abb3d74e734d551"
BUILD_TIME="Mon Mar  2 17:21:42 UTC 2020"
BUILD_KERNEL="4.14.165-131.185.amzn2.x86_64"
ARCH="x86_64"

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 10
Comments: 83 (26 by maintainers)

Most upvoted comments

We’ve been able to root cause the issue with the AmazonLinux team.

When the containers have a write heavy workload which run on IOPS constrained EBS volumes, EBS starts to throttle IOPS. The kernel is unable to flush the dirty pages to disk because of this throttling. The dirty pages limit is even more constrained for each cgroup/container and is directly proportional to the memory requested by the container.

When number of dirty pages for a container increases, the kernel tries to flush the pages to disk. In 4.14 kernel, the code which flushes these pages to disk does wasteful work in building up the queuework items of pages to flush instead of actually flushing them. This causes the soft lockup errors and explains why we don’t see any I/O going to disk during this event. We have found the patch in kernel 4.15.x and onwards which fixes this issue.

We are working on backporting this patch to 4.14 kernel so it can be released with an EKS optimized AMI.

mmerkes on Oct 23, 2020

Hey!

A small update - we are running the solution on production for several weeks now and the node freeze problem never appeared.

We made several improvements and learned from mistakes:

despite both docker and kubelet support overriding working directory (–data-root and –root-dir) it doesn’t work well - other tools might assume the default location and it can lead to unexpected issues, for example
using cloud-init tool we can instruct AWS to execute bootstrapping scripts earlier in the startup sequence, making sure NVMe disk is prepared and directories mounted before docker and kubelet even started

We are using the terraform AWS EKS module to create a cluster. Here is how our final solution looks like now:

lib/cluster-custom-userdata.tpl:

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

cloud-init-per once installreqs yum -y install lvm2

cloud-init-per once pvcreate pvcreate /dev/nvme1n1
cloud-init-per once vgcreate vgcreate eks /dev/nvme1n1

cloud-init-per once lvdocker lvcreate --name docker --size 200GB eks
cloud-init-per once lvkubelet lvcreate --name kubelet -l 100%FREE eks

cloud-init-per once format_docker mkfs.xfs /dev/eks/docker
cloud-init-per once format_kubelet mkfs.xfs /dev/eks/kubelet

cloud-init-per once create_pods_dir mkdir -p /var/lib/kubelet/pods

cloud-init-per once mount_docker mount -t xfs -o noatime,inode64,noquota /dev/eks/docker /var/lib/docker
cloud-init-per once mount_kubelet mount -t xfs -o noatime,inode64,noquota /dev/eks/kubelet /var/lib/kubelet/pods

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash -xe

# Allow user supplied pre userdata code
${pre_userdata}

# Bootstrap and join the cluster
/etc/eks/bootstrap.sh --b64-cluster-ca '${cluster_auth_base64}' --apiserver-endpoint '${endpoint}' ${bootstrap_extra_args} --kubelet-extra-args "${kubelet_extra_args}" '${cluster_name}'

# Allow user supplied userdata code
${additional_userdata}

--==BOUNDARY==--

cluster.tf:

module "cluster" {
  source  = "terraform-aws-modules/eks/aws"
  version = "12.2.0"

  cluster_version = "1.17"
  
  [...]
  
  worker_groups = [
    {
      userdata_template_file = file("${path.module}/lib/cluster-custom-userdata.tpl")
              
      [...]
    }
  ]
}

Huge kudos to AWS engineer @raonitimo who was assisting us with this issue, working with him was a pleasure and the most enjoyable experience with AWS support by far.

Regards, Deniss Rostkovskis

deniss-rostkovskis on Aug 11, 2020

I’ve shared my team’s reproducer with the AWS team. I cannot share the gzipped tarball itself here, but it’s essentially a few hundred mb of small files, in many directories. Running 8 replicas of tar extraction in a tight loop reliably produces the lockup on m5.2xl instances running 4.14, but not 5.4.

JacobHenner on Oct 8, 2020

We had the same problem upgrading our worker nodes to Amazon EKS 1.15 AMI. We tried:

amazon-eks-node-1.15-v20200507
amazon-eks-node-1.15-v20200423

and both had the same problem.

We have pods with initContainers copying about 1Gb of small files (WP install), and during the copy, in the Init phase, the worker nodes hang, becoming completely unresponsive.

Syslog on the worker node reports:

[  288.053638] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [kworker/u16:2:62]
[  288.059141] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache veth iptable_mangle xt_connmark nf_conntrack_netlink nfnetlink xt_statistic xt_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay sunrpc crc32_pclmul ghash_clmulni_intel pcbc mousedev aesni_intel aes_x86_64 crypto_simd evdev glue_helper psmouse cryptd button ena ip_tables x_tables xfs libcrc32c nvme crc32c_intel nvme_core ipv6 crc_ccitt autofs4
[  288.185290] CPU: 5 PID: 62 Comm: kworker/u16:2 Tainted: G             L  4.14.177-139.253.amzn2.x86_64 #1
[  288.191527] Hardware name: Amazon EC2 m5.2xlarge/, BIOS 1.0 10/16/2017
[  288.195344] Workqueue: writeback wb_workfn (flush-259:0)
[  288.198708] task: ffff888184670000 task.stack: ffffc90003360000
[  288.202280] RIP: 0010:move_expired_inodes+0xff/0x230
[  288.205542] RSP: 0018:ffffc90003363cc8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff10
[  288.211042] RAX: 00000000ffffa056 RBX: ffffc90003363ce8 RCX: dead000000000200
[  288.215031] RDX: 0000000000000000 RSI: ffffc90003363ce8 RDI: ffff8887068963c8
[  288.219040] RBP: ffff888802273c70 R08: ffff888706896008 R09: 0000000100400010
[  288.223047] R10: ffffc90003363e10 R11: 0000000000025400 R12: ffff8888227f6800
[  288.227062] R13: ffff888706896788 R14: ffffc90003363d78 R15: ffff888706896008
[  288.231071] FS:  0000000000000000(0000) GS:ffff888822740000(0000) knlGS:0000000000000000
[  288.236761] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  288.240282] CR2: 00007f5b703af570 CR3: 000000000200a005 CR4: 00000000007606e0
[  288.244306] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  288.248328] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  288.252351] PKRU: 55555554
[  288.254821] Call Trace:
[  288.257178]  queue_io+0x61/0xf0
[  288.259798]  wb_writeback+0x258/0x300
[  288.262600]  ? wb_workfn+0xdf/0x370
[  288.265323]  ? __local_bh_enable_ip+0x6c/0x70
[  288.268370]  wb_workfn+0xdf/0x370
[  288.271040]  ? __switch_to_asm+0x41/0x70
[  288.273944]  ? __switch_to_asm+0x35/0x70
[  288.276845]  process_one_work+0x17b/0x380
[  288.279728]  worker_thread+0x2e/0x390
[  288.282509]  ? process_one_work+0x380/0x380
[  288.285482]  kthread+0x11a/0x130
[  288.288134]  ? kthread_create_on_node+0x70/0x70
[  288.291201]  ret_from_fork+0x35/0x40
[  288.293964] Code: b9 01 00 00 00 0f 44 4c 24 04 89 4c 24 04 49 89 c4 48 8b 45 00 48 39 c5 74 1a 4d 85 f6 4c 8b 6d 08 0f 84 67 ff ff ff 49 8b 45 e0 <49> 39 06 0f 89 5a ff ff ff 8b 44 24 04 85 c0 75 51 48 8b 44 24 
[  293.673669] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 219.
[  293.679790] ena 0000:00:05.0 eth0: Found a Tx that wasn't completed on time, qid 0, index 220.

As workaround, we took the offical Ubuntu EKS 1.15 image ami-0f54d80ab3d460266, we added in it the nfs-common package to manage the EFS and we rebuilt a new custom AMI from it.

Note: Using kiam, we had to change the CA certificate path, because the location in Ubuntu is different from the one in AmazonLinux image.

colandre on May 22, 2020

@rphillips @cshivashankar Here is a thread that discusses the issue and I believe this is the upstream patch that resolves the issue.

mmerkes on Oct 27, 2020

The reproducer runs 8 replicas of a container which downloads and extracts a 22mb (277mb uncompressed) gzipped-tarball in a tight loop. At first I figured this might be related to the use of an emptyDir, as most of the pods which seemed to trigger this issue used emptyDirs, but the reproducer was able to reproduce the node failures both with and without emptyDirs. I was also able to reproduce the lockup behavior on m5.2xls and m4.2xls, although the log messages were different between the two.

JacobHenner on Oct 7, 2020

Upgrading the kernel from 4.14 to 5.4 to seems to resolve the issue for customers. The one risk of the amazon-linux-extras install kernel-ng method is that kernel-ng refers to the next generation kernel, so it’s not going to guarantee you’ll always be on the 5.4 kernel. At some point in the future, it could get bumped to 5.9. Also, EKS doesn’t run conformance tests on the 5.4 kernel, so we can’t provide the same guarantees, though it’s officially supported by AmazonLinux.

We’ve been struggling to get a repro of the issue, but we’ve been working with the AmazonLinux team and have found a few kernel patches that we suspect may fix the issue. We’re working on a plan to get those out and figuring out how to test if they resolve the issue or not. I will post an update here when I have more information.

mmerkes on Sep 25, 2020

@deniss-rostkovskis We think we’re hitting this issue too in our prod cluster. Are there any telltale metrics or logs that we could look for to confirm it’s almost certainly the same thing? (there’s quite a bit on this thread now so wanted to confirm)

robertgates55 on Sep 17, 2020

Can those of you experiencing this issue please try the following on your nodes, reboot, and see if you’re still experiencing the issue?
sudo amazon-linux-extras install kernel-ng
sudo reboot

This seems to work after testing it a few times.

brettplarson on Sep 8, 2020

@cshivashankar for cost reasons [or our perceptions of the costs], we didn’t pursue IOPS-provisioned EBS volumes, so I don’t have any data to provide. I.e. we were previously just using vanilla ‘gp2’ drives.

I think that one could use IOPS-provisioned (i.e. “io1”) drives instead of following our present solution. That would be a bit simpler.

Ephemeral SSDs associated with ephemeral instance like our use-case is a nice match. It seems likely that our (new) SSD drives are overprovisioned w.r.t size, in light of the available CPU/RAM/SSD size ratios. Other users might use more or less disk than us.

BTW we down-sized the root EBS partition to 20GB but even 10GB would probably be sufficient.

jae-63 on May 28, 2020