kubernetes: Pods stuck terminating (MountVolume.SetUp failed)
Is this a BUG REPORT or FEATURE REQUEST?:
Uncomment only one, leave it on its own line:
/kind bug
/kind feature
What happened: Updating ingress and deleted service, daemon set, service account, cluster role binding, and config map in that order. I then deployed the ingress pods into a new namespace. The new pods never could start because the old pods got stuck in terminating and have been for hours now. What you expected to happen: Old pods to die and new pods to take their place, with some downtime. How to reproduce it (as minimally and precisely as possible): I can provide yaml if valuable, but I’m using juju to deploy so I just did the following:
juju deploy canonical-kubernetes
juju config kubernetes-worker ingress=false
kubectl deploy k8s_1.12_ingress.yaml
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): 1.12.1 - Cloud provider or hardware configuration: gce and aws
- OS (e.g. from /etc/os-release): Ubuntu 18.04.1 LTS
- Kernel (e.g.
uname -a
): 4.15.0-1021-gcp - Install tools: juju
- Others:
$ kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default microbot-66cb4987b7-h4wkd 1/1 Running 0 8m7s
default microbot-66cb4987b7-sn6dm 1/1 Running 0 8m7s
default microbot-66cb4987b7-tgrmc 1/1 Running 0 8m7s
default nginx-ingress-kubernetes-worker-controller-9rp7h 0/1 Terminating 0 45m
default nginx-ingress-kubernetes-worker-controller-b2qn2 0/1 Terminating 0 44m
default nginx-ingress-kubernetes-worker-controller-mnhgq 0/1 Terminating 0 42m
ingress-nginx-kubernetes-worker default-http-backend-kubernetes-worker-5d9bb77bc5-g8lcz 1/1 Running 0 22m
ingress-nginx-kubernetes-worker nginx-ingress-controller-kubernetes-worker-6gzw8 0/1 Pending 0 22m
ingress-nginx-kubernetes-worker nginx-ingress-controller-kubernetes-worker-sksm6 0/1 Pending 0 22m
ingress-nginx-kubernetes-worker nginx-ingress-controller-kubernetes-worker-tmjzv 0/1 Pending 0 22m
kube-system heapster-v1.6.0-beta.1-6db4b87d-c5cws 4/4 Running 0 42m
kube-system kube-dns-596fbb8fbd-dr692 3/3 Running 0 46m
kube-system kubernetes-dashboard-67d4c89764-bgt6k 1/1 Running 0 46m
kube-system metrics-server-v0.3.1-67bb5c8d7-t7lh7 2/2 Running 0 44m
kube-system monitoring-influxdb-grafana-v4-65cc9bb8c8-pfsxn 2/2 Running 0 46m
$ kubectl describe po/nginx-ingress-kubernetes-worker-controller-9rp7h
Name: nginx-ingress-kubernetes-worker-controller-9rp7h
Namespace: default
Node: juju-fdbd8c-7/252.0.112.1
Start Time: Fri, 19 Oct 2018 17:16:17 -0400
Labels: controller-revision-hash=5dc6fb876
name=nginx-ingress-kubernetes-worker
pod-template-generation=1
Annotations: <none>
Status: Terminating (lasts <invalid>)
Termination Grace Period: 60s
IP: 252.0.112.1
Controlled By: DaemonSet/nginx-ingress-kubernetes-worker-controller
Containers:
nginx-ingress-kubernetes-worker:
Container ID: docker://42a947b16172b00f53a8655e03e4d979ef810c851ddc146e8855583632330ea3
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.16.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:2fa84bfa338fbc84672521c443074f0b2ab30ad0b6bea50c4c29ee2d012fbcba
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--default-backend-service=$(POD_NAMESPACE)/default-http-backend
--configmap=$(POD_NAMESPACE)/nginx-load-balancer-conf
--enable-ssl-chain-completion=False
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Liveness: http-get http://:10254/healthz delay=30s timeout=5s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-kubernetes-worker-controller-9rp7h (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw
Optional: false
QoS Class: BestEffort
Node-Selectors: juju-application=kubernetes-worker
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 45m default-scheduler Successfully assigned default/nginx-ingress-kubernetes-worker-controller-9rp7h to juju-fdbd8c-7
Normal Pulling 45m kubelet, juju-fdbd8c-7 pulling image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.16.1"
Normal Pulled 45m kubelet, juju-fdbd8c-7 Successfully pulled image "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.16.1"
Normal Created 45m kubelet, juju-fdbd8c-7 Created container
Normal Started 45m kubelet, juju-fdbd8c-7 Started container
Warning FailedMount 44m (x4 over 45m) kubelet, juju-fdbd8c-7 MountVolume.SetUp failed for volume "nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw" : couldn't propagate object cache: timed out waiting for the condition
Warning FailedMount 22m (x4 over 22m) kubelet, juju-fdbd8c-7 MountVolume.SetUp failed for volume "nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw" : secret "nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw" not found
Normal Killing 22m kubelet, juju-fdbd8c-7 Killing container with id docker://nginx-ingress-kubernetes-worker:Need to kill Pod
Waited a few hours and then
$ kubectl get po
NAME READY STATUS RESTARTS AGE
microbot-66cb4987b7-h4wkd 1/1 Running 0 4h3m
microbot-66cb4987b7-sn6dm 1/1 Running 0 4h3m
microbot-66cb4987b7-tgrmc 1/1 Running 0 4h3m
nginx-ingress-kubernetes-worker-controller-9rp7h 0/1 Terminating 0 4h40m
nginx-ingress-kubernetes-worker-controller-b2qn2 0/1 Terminating 0 4h39m
nginx-ingress-kubernetes-worker-controller-mnhgq 0/1 Terminating 0 4h37m
...
$ kubectl describe po/nginx-ingress-kubernetes-worker-controller-9rp7h
Name: nginx-ingress-kubernetes-worker-controller-9rp7h
Namespace: default
Node: juju-fdbd8c-7/252.0.112.1
Start Time: Fri, 19 Oct 2018 17:16:17 -0400
Labels: controller-revision-hash=5dc6fb876
name=nginx-ingress-kubernetes-worker
pod-template-generation=1
Annotations: <none>
Status: Terminating (lasts <invalid>)
Termination Grace Period: 60s
IP: 252.0.112.1
Controlled By: DaemonSet/nginx-ingress-kubernetes-worker-controller
Containers:
nginx-ingress-kubernetes-worker:
Container ID: docker://42a947b16172b00f53a8655e03e4d979ef810c851ddc146e8855583632330ea3
Image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.16.1
Image ID: docker-pullable://quay.io/kubernetes-ingress-controller/nginx-ingress-controller@sha256:2fa84bfa338fbc84672521c443074f0b2ab30ad0b6bea50c4c29ee2d012fbcba
Ports: 80/TCP, 443/TCP
Host Ports: 80/TCP, 443/TCP
Args:
/nginx-ingress-controller
--default-backend-service=$(POD_NAMESPACE)/default-http-backend
--configmap=$(POD_NAMESPACE)/nginx-load-balancer-conf
--enable-ssl-chain-completion=False
State: Terminated
Exit Code: 0
Started: Mon, 01 Jan 0001 00:00:00 +0000
Finished: Mon, 01 Jan 0001 00:00:00 +0000
Ready: False
Restart Count: 0
Liveness: http-get http://:10254/healthz delay=30s timeout=5s period=10s #success=1 #failure=3
Environment:
POD_NAME: nginx-ingress-kubernetes-worker-controller-9rp7h (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw:
Type: Secret (a volume populated by a Secret)
SecretName: nginx-ingress-kubernetes-worker-serviceaccount-token-w7kcw
Optional: false
QoS Class: BestEffort
Node-Selectors: juju-application=kubernetes-worker
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/network-unavailable:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unschedulable:NoSchedule
Events: <none>
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 21
- Comments: 56 (13 by maintainers)
Commits related to this issue
- Reverting nginx 1.12 changes until https://github.com/kubernetes/kubernetes/issues/70044 is fixed — committed to juju-solutions/kubernetes by hyperbolic2346 6 years ago
- Reverting nginx 1.12 changes until https://github.com/kubernetes/kubernetes/issues/70044 is fixed — committed to juju-solutions/kubernetes by hyperbolic2346 6 years ago
- Revert "Reverting nginx 1.12 changes until https://github.com/kubernetes/kubernetes/issues/70044 is fixed" This reverts commit aa84314266d7e499dd306bf35e92f2301b6acd2b. — committed to juju-solutions/kubernetes by hyperbolic2346 6 years ago
- Reverting nginx 1.12 changes until https://github.com/kubernetes/kubernetes/issues/70044 is fixed — committed to charmed-kubernetes/charm-kubernetes-worker by hyperbolic2346 6 years ago
- Revert "Reverting nginx 1.12 changes until https://github.com/kubernetes/kubernetes/issues/70044 is fixed" This reverts commit aa84314266d7e499dd306bf35e92f2301b6acd2b. — committed to charmed-kubernetes/charm-kubernetes-worker by hyperbolic2346 6 years ago
I’m facing this issue, is there any progress on this?
/remove-triage needs-information /triage accepted
(assuming the repro works)
/remove-lifecycle rotten
I noticed this can happen when apiserver is on high load, but should retry and mount volume successful.
In my case, pod stuck terminating because wrong status reported by docker daemon. kubelet try to stop then kill and delete container, but docker report the container is live even with stop or kill. kubelet only call
docker rm container_id
, but in this situation, container can only be deleted bydocker rm -f container_id
(and yes this is a docker bug)kubelet logs on this node should always helpful for explaining why pod stuck terminating in my experience
I’ve been spending quite a large amount of time attempting to debug/understand why pods can get stuck in a terminating state like this (https://github.com/longhorn/longhorn/issues/2629 for reference, the fundamental issue is the same but has a different cause)
I believe I’ve identified one path that gets to this state, but it specifically deals with a restarted kubelet due to a mismatch between desired and actual state. It’s quite rare that someone would hit this but it’s extremely reproducible if manipulated properly.
Essentially, if a pod is created and uses a secret that is deleted after the pod is created/running, then the
kubelet
is restarted, thekubelet
will emit theMountVolume.SetUp
error messages as it attempts to reconcile desired state against the contents on disk.The reconciler has a
syncStates
function: https://github.com/kubernetes/kubernetes/blob/v1.21.1/pkg/kubelet/volumemanager/reconciler/reconciler.go#L385 that inspects the kubelet directory, and attempts to reconcile the desired/actual state of the world with it. If you look at https://github.com/kubernetes/kubernetes/blob/v1.21.1/pkg/kubelet/volumemanager/reconciler/reconciler.go#L415-L424 you can see that if the volume exists in the desired state of the world (which it should, because the pod is there and is using the volume), then it marks the volumeInUse
, and thekubelet
will then attempt toMountVolume.SetUp
the volume again. The issue is, if the secret was deleted, then theMountVolume.SetUp
will fail and theActual State of the World
will never be populated with the secret volume.The issue with this is, on termination of a pod, when the
kubelet
is determining what volumes to remove in order to finish terminating the pod, it doesn’t attempt to unmount the secret because the secret isn’t in the actual state of the world: https://github.com/kubernetes/kubernetes/blob/v1.21.1/pkg/kubelet/volumemanager/populator/desired_state_of_world_populator.go#L281-L289Interestingly, if I
umount
thetmpfs
mount of the secret that it is complaining about, thekubelet
will terminate the pod as it considers it “orphaned” i.e.With my pod stuck in terminating, I see logs like:
then after a
umount /var/lib/kubelet/pods/45446e4b-dd76-4339-96fa-0d6881250103/volumes/kubernetes.io~secret/rke2-ingress-nginx-v3-token-6m4fk
I see logs like:and the pod disappears.
@ehashman can we unstale this?
I’m also getting this issue, and nothing is fixing it. Everything was working fine until the last week or so, and I’ve made no changes to any of my configs. Literally the only thing I’ve done on the affected machine is
apt-get update
for the latest packages onUbuntu 18.04
.I’ve tried:
Thanks for the help, @fntlnz!
This should be fixed via https://github.com/kubernetes/kubernetes/pull/110670 now.
/close
@gnufied to be fair, https://github.com/kubernetes/kubernetes/issues/96635 is a duplicate of this one. This one existed almost 2 years earlier.
@brandond you mean this bit from @Oats87 ? This is the simplest repro?