amazon-eks-ami: Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112

What happened: Since upgrading to AMI 1.16.15-20201112 (from 1.16.13-20201007), we see a lot of Pods get stuck in Terminating state. We have noticed that all of these Pods have readiness/liveness probes of type exec.

What you expected to happen: The Pods should be deleted.

How to reproduce it (as minimally and precisely as possible): Apply the following YAML to create a deployment with exec type probes for readiness/liveness:

$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "true"
          failureThreshold: 5
          initialDelaySeconds: 1
          periodSeconds: 1
          successThreshold: 1
          timeoutSeconds: 1
EOF

and once all Pods become ready, delete the Deployment:

$ kubectl delete deployment nginx-deployment

Anything else we need to know?: We also tried the above with a 1.17 EKS cluster (AMI release version 1.17.12-20201112) and it exhibits the same behavior.

Environment:

  • AWS Region: eu-central-1
  • Instance Type(s): m5d.xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.16
  • AMI Version: 1.16.15-20201112

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 45
  • Comments: 47 (15 by maintainers)

Commits related to this issue

Most upvoted comments

All managed nodegroups on release version 20201112 can now be upgraded to 20201117. If you create new nodegroups, they will automatically get 20201117 release version. Please let us know if you see any issues.

@rtripat I can also confirm that the issue has been resolved for us since upgrading to version 20201117. Thanks for fixing this. I guess this issue can be closed now.

However, given the magnitude of this, I think you should increase the priority of aws/containers-roadmap#810. It became apparent that users couldn’t follow your proposed workaround of rolling back to version 20201007 (#563 (comment)) because there is no way to choose the version of the AMI to deploy in managed nodegroups.

We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as Degraded if they are on recalled AMI release versions.

@rtripat Curious to know how come this got past your QA or testing cycles? Doesn’t seem like a deep-rooted corner case that it can’t be caught? This impacted my production deployments bigtime today. 😦

Seems to be related to this https://github.com/moby/moby/issues/41352#issuecomment-728746859. Can someone run this on their node (if its not a production cluster) and let me know if this fixes the issue. I did try on couple of my worker nodes and both upgrading/downgrading containerd seems to fix the issue. I’m just trying to narrow down what might have caused this.

cat << EOF > upgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
systemctl stop docker
systemctl stop containerd
wget https://github.com/containerd/containerd/releases/download/v1.4.1/containerd-1.4.1-linux-amd64.tar.gz
tar xvf containerd-1.4.1-linux-amd64.tar.gz
cp -f bin/c* /bin/
systemctl start docker
systemctl start containerd
systemctl restart kubelet
systemctl status docker
systemctl start containerd
systemctl status kubelet
docker version
docker ps
EOF
chmod +x upgrade-containerd.sh
sudo ./upgrade-containerd.sh

or

cat << EOF > downgrade-containerd.sh
#!/bin/bash
set -eo pipefail
docker ps
sudo yum downgrade containerd-1.3.2-1.amzn2.x86_64
systemctl restart docker
systemctl restart kubelet
docker ps
EOF
chmod +x downgrade-containerd.sh
sudo ./downgrade-containerd.sh

Last Saturday, we upgraded our clusters in 4 production regions (AP, AU, EU, US) from v1.14 to v1.18 and nightmares happened. The issue caused many of pods in our Zookeeper clusters stuck in “Terminating” state and affected other clusters (Kafka clusters, SolrCloud clusters). Doing the “kubectl delete pod --force --grace-period=0 xxx” sometimes cause filesystem corruptions. We tried our bests to keep our systems up and running but it is a bad experience on upgrading EKS clusters. Positive things:

  1. The issue is fixed
  2. With version 1.18, we have another 1 more happy year living without the need of upgrading EKS clusters 😃

Not sure if its similar or not, but we are experiencing an issue on EKS 1.15 (eks.4) with AMI version 1.15.12-20201112, where the aws-node pods are repeatedly producing k8s events with the following message, we do not see this on the v20201007 ami

Message:             Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

We were seeing the same things mentioned above. The new AMI (20201117) solved it for us with the pinned containerd version at 1.3.2. Thanks all.

We also tested side-by-side deployments, one with liveness and readiness probes as above, and one without. The one without was able to terminate correctly, the one with the probes were stuck in Terminating state.

@rtripat is you reply because something has changed for managed node groups since this issue was active and resolved?

Right. A corrective action item that came out of this AMI release was to allow customers to rollback to a previous AMI release version. So, I wanted to share the EKS Managed Nodegroup API allows customers to create/upgrade a nodegroup to any AMI release version.

Same feature request as in https://github.com/aws/containers-roadmap/issues/810

@dgarbus

It’s not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups

Indeed. We have opened an issue for that (see https://github.com/awslabs/amazon-eks-ami/issues/435), which resulted to a open request in containers-roadmap (see https://github.com/aws/containers-roadmap/issues/810). Given the magnitude of the current problem, this missing feature becomes even more relevant now.

Even on nodes with old ami, we are seeing this happen because our userdata script runs yum update -y which brings along containerd 1.4.0. we shall try 1.4.1 to see if that helps.

We are working on releasing a new AMI with containerd 1.3.2. Until then please rollback your worker nodes to the last AMI v20201007

It’s not possible to rollback to a previous AMI (or create a new nodegroup with an AMI that is not the latest) when using managed node groups. Do you have an ETA for the new AMI?

We are rolling back Managed Nodegroup as well. The rollback should complete today. We will try to release the new AMI today as well but I will keep this issue updated. Appreciate the patience.

Thanks for the quick response. As a stopgap measure, is it possible to update the “latest marker” so that new managed nodegroups get created using the previous, working AMI?

Same issue on EKS 1.18 platform version eks.1