amazon-eks-ami: Pods stuck in terminating state after AMI amazon-eks-node-1.16.15-20201112
What happened:
Since upgrading to AMI 1.16.15-20201112 (from 1.16.13-20201007), we see a lot of Pods get stuck in Terminating state. We have noticed that all of these Pods have readiness/liveness probes of type exec.
What you expected to happen: The Pods should be deleted.
How to reproduce it (as minimally and precisely as possible):
Apply the following YAML to create a deployment with exec type probes for readiness/liveness:
$ cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 20
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "true"
failureThreshold: 5
initialDelaySeconds: 1
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 1
livenessProbe:
exec:
command:
- /bin/sh
- -c
- "true"
failureThreshold: 5
initialDelaySeconds: 1
periodSeconds: 1
successThreshold: 1
timeoutSeconds: 1
EOF
and once all Pods become ready, delete the Deployment:
$ kubectl delete deployment nginx-deployment
Anything else we need to know?:
We also tried the above with a 1.17 EKS cluster (AMI release version 1.17.12-20201112) and it exhibits the same behavior.
Environment:
- AWS Region: eu-central-1
- Instance Type(s): m5d.xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version): 1.16 - AMI Version: 1.16.15-20201112
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 45
- Comments: 47 (15 by maintainers)
Commits related to this issue
- Downgrades containerd to containerd-1.3.2-1.amzn2 to fix issue #563 — committed to mmerkes/amazon-eks-ami by mmerkes 4 years ago
- Downgrades containerd to containerd-1.3.2-1.amzn2 to fix issue #563 (#564) — committed to awslabs/amazon-eks-ami by mmerkes 4 years ago
We are taking multiple steps to prevent recurrence of this issue. Specifically, we have added a regression test for this specific case which creates a container with HEALTHCHECK, monitors it’s liveness for a period of time and ensures a cleanup on termination. We are also working on changes to allow creating EKS Managed Nodegroup at any AMI version and mark them as
Degradedif they are on recalled AMI release versions.@rtripat Curious to know how come this got past your QA or testing cycles? Doesn’t seem like a deep-rooted corner case that it can’t be caught? This impacted my production deployments bigtime today. 😦
Seems to be related to this https://github.com/moby/moby/issues/41352#issuecomment-728746859. Can someone run this on their node (if its not a production cluster) and let me know if this fixes the issue. I did try on couple of my worker nodes and both upgrading/downgrading containerd seems to fix the issue. I’m just trying to narrow down what might have caused this.
or
Last Saturday, we upgraded our clusters in 4 production regions (AP, AU, EU, US) from v1.14 to v1.18 and nightmares happened. The issue caused many of pods in our Zookeeper clusters stuck in “Terminating” state and affected other clusters (Kafka clusters, SolrCloud clusters). Doing the “kubectl delete pod --force --grace-period=0 xxx” sometimes cause filesystem corruptions. We tried our bests to keep our systems up and running but it is a bad experience on upgrading EKS clusters. Positive things:
Not sure if its similar or not, but we are experiencing an issue on EKS 1.15 (eks.4) with AMI version 1.15.12-20201112, where the aws-node pods are repeatedly producing k8s events with the following message, we do not see this on the
v20201007amiWe were seeing the same things mentioned above. The new AMI (20201117) solved it for us with the pinned containerd version at 1.3.2. Thanks all.
We also tested side-by-side deployments, one with liveness and readiness probes as above, and one without. The one without was able to terminate correctly, the one with the probes were stuck in Terminating state.
Right. A corrective action item that came out of this AMI release was to allow customers to rollback to a previous AMI release version. So, I wanted to share the EKS Managed Nodegroup API allows customers to create/upgrade a nodegroup to any AMI release version.
Same feature request as in https://github.com/aws/containers-roadmap/issues/810
@dgarbus
Indeed. We have opened an issue for that (see https://github.com/awslabs/amazon-eks-ami/issues/435), which resulted to a open request in containers-roadmap (see https://github.com/aws/containers-roadmap/issues/810). Given the magnitude of the current problem, this missing feature becomes even more relevant now.
Even on nodes with old ami, we are seeing this happen because our userdata script runs yum update -y which brings along containerd 1.4.0. we shall try 1.4.1 to see if that helps.
Thanks for the quick response. As a stopgap measure, is it possible to update the “latest marker” so that new managed nodegroups get created using the previous, working AMI?
Same issue on EKS 1.18 platform version eks.1