kubernetes: Kubelet gets stuck trying to inspect a container whose image has been cleaned up
What happened: Kubelet loops trying to inspect docker container for a pod whose image has been deleted (names and hashes replaced for brevity):
E1023 00:01:11.896466 4234 generic.go:241] PLEG: Ignoring events for pod $POD_NAME: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900142 4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900176 4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900185 4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900204 4234 generic.go:271] PLEG: pod $POD_NAME failed reinspection: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907868 4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907896 4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907903 4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907917 4234 generic.go:241] PLEG: Ignoring events for pod $POD_NAME: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911373 4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911413 4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911420 4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911434 4234 generic.go:271] PLEG: pod $POD_NAME failed reinspection: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
As a result, the pod gets stuck on status Terminating
.
What you expected to happen: The pod finishes normally
How to reproduce it (as minimally and precisely as possible): This is happening on machines with frequently scheduled jobs with varying large images that take up ~25% of disk space each; we suspect this is triggering the image cleanup. Perhaps related to https://github.com/kubernetes/kubernetes/issues/59564 as well
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-29T16:15:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-825e5d", GitCommit:"825e5de08cb05714f9b224cd6c47d9514df1d1a7", GitTreeState:"clean", BuildDate:"2019-08-18T03:58:32Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration: AWS
- OS (e.g:
cat /etc/os-release
):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
-
Kernel (e.g.
uname -a
):Linux (hostname) 4.14.138-114.102.amzn2.x86_64 #1 SMP Thu Aug 15 15:29:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
-
Install tools:
-
Network plugin and version (if this is a network-related bug):
-
Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 22 (10 by maintainers)
We are hitting this consistently when using a combination of init containers and
docker system prune
. Prune will clean the init container’s image and then when the pod gets deleted (e.g. with a new deployment), it gets stuck inDeleting
stateIt seems that get pod status from Podcache error blocks pod sync function.
dead loop.
I solve the problem when I pull the image
Ive also stumbled across this - our clusters on <=1.13 never ran into this problem (we were running
docker system prune
daily), but on our 1.17 clusters we run into this off and on. The result is:docker system prune --all --force
runs, cleans up image Bkubelet
attempts to inspect pod A’s containers, and tries to look up image Bv1.18.0-alpha.0.1500+d02cde705f8120
)A short term fix seems to be to turn off the
docker system prune
cron./assign
The reason may be here