kubernetes: Kubelet gets stuck trying to inspect a container whose image has been cleaned up

What happened: Kubelet loops trying to inspect docker container for a pod whose image has been deleted (names and hashes replaced for brevity):

E1023 00:01:11.896466    4234 generic.go:241] PLEG: Ignoring events for pod $POD_NAME: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900142    4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900176    4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900185    4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:11.900204    4234 generic.go:271] PLEG: pod $POD_NAME failed reinspection: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907868    4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907896    4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907903    4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.907917    4234 generic.go:241] PLEG: Ignoring events for pod $POD_NAME: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911373    4234 remote_runtime.go:282] ContainerStatus "$CONTAINER_SHA" from runtime service failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911413    4234 kuberuntime_container.go:393] ContainerStatus for $CONTAINER_SHA error: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911420    4234 kuberuntime_manager.go:866] getPodContainerStatuses for pod "sim-orv-upl-left-yield-157176217719-mfz75_simian-prod(d1fc4030-f51e-11e9-805e-02a3d02b55ba)" failed: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"
E1023 00:01:12.911434    4234 generic.go:271] PLEG: pod $POD_NAME failed reinspection: rpc error: code = Unknown desc = unable to inspect docker image "sha256:$IMAGE_SHA" while inspecting docker container "$CONTAINER_SHA": no such image: "sha256:$IMAGE_SHA"

As a result, the pod gets stuck on status Terminating.

What you expected to happen: The pod finishes normally

How to reproduce it (as minimally and precisely as possible): This is happening on machines with frequently scheduled jobs with varying large images that take up ~25% of disk space each; we suspect this is triggering the image cleanup. Perhaps related to https://github.com/kubernetes/kubernetes/issues/59564 as well

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.7", GitCommit:"6f482974b76db3f1e0f5d24605a9d1d38fad9a2b", GitTreeState:"clean", BuildDate:"2019-03-29T16:15:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-825e5d", GitCommit:"825e5de08cb05714f9b224cd6c47d9514df1d1a7", GitTreeState:"clean", BuildDate:"2019-08-18T03:58:32Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: AWS
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a): Linux (hostname) 4.14.138-114.102.amzn2.x86_64 #1 SMP Thu Aug 15 15:29:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Install tools:

  • Network plugin and version (if this is a network-related bug):

  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 6
  • Comments: 22 (10 by maintainers)

Most upvoted comments

We are hitting this consistently when using a combination of init containers and docker system prune. Prune will clean the init container’s image and then when the pod gets deleted (e.g. with a new deployment), it gets stuck in Deleting state

image

It seems that get pod status from Podcache error blocks pod sync function.

dead loop.

I solve the problem when I pull the image

Ive also stumbled across this - our clusters on <=1.13 never ran into this problem (we were running docker system prune daily), but on our 1.17 clusters we run into this off and on. The result is:

  1. k8s pod A runs with image B, terminates.
  2. docker system prune --all --force runs, cleans up image B
  3. kubelet attempts to inspect pod A’s containers, and tries to look up image B
  4. kubelet panics (see log below) (stack trace from a build of v1.18.0-alpha.0.1500+d02cde705f8120)
Feb 19 14:57:53 host123 kubelet[72693]: W0219 14:57:53.398884   72693 docker_container.go:353] ignore error image "sha256:<somesha>" not found while inspecting docker container "<anothersha>": no such image: "sha256:<somesha>"
Feb 19 14:57:53 host123 kubelet[72693]: panic: runtime error: invalid memory address or nil pointer dereference
Feb 19 14:57:53 host123 kubelet[72693]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x32b18aa]
Feb 19 14:57:53 host123 kubelet[72693]: goroutine 1301 [running]:
Feb 19 14:57:53 host123 kubelet[72693]: k8s.io/kubernetes/pkg/kubelet/dockershim.(*dockerService).ContainerStatus(0xc000db40b0, 0x49f3800, 0xc0013dbe60, 0xc001827280, 0xc000db40b0, 0xc0013dbe60, 0xc002427a80)
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/dockershim/docker_container.go:418 +0x6fa
Feb 19 14:57:53 host123 kubelet[72693]: k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2._RuntimeService_ContainerStatus_Handler(0x4246940, 0xc000db40b0, 0x49f3800, 0xc0013dbe60, 0xc001921d40, 0x0, 0x49f3800, 0xc0013dbe60, 0xc0015be870, 0x42)
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/cri-api/pkg/apis/runtime/v1alpha2/api.pb.go:7861 +0x217
Feb 19 14:57:53 host123 kubelet[72693]: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).processUnaryRPC(0xc000b2d080, 0x4a54ba0, 0xc000e02180, 0xc00057a600, 0xc0006862a0, 0x6d73e48, 0x0, 0x0, 0x0)
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/server.go:995 +0x460
Feb 19 14:57:53 host123 kubelet[72693]: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).handleStream(0xc000b2d080, 0x4a54ba0, 0xc000e02180, 0xc00057a600, 0x0)
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/server.go:1275 +0xd97
Feb 19 14:57:53 host123 kubelet[72693]: k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc000aa56b0, 0xc000b2d080, 0x4a54ba0, 0xc000e02180, 0xc00057a600)
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/server.go:710 +0xbb
Feb 19 14:57:53 host123 kubelet[72693]: created by k8s.io/kubernetes/vendor/google.golang.org/grpc.(*Server).serveStreams.func1
Feb 19 14:57:53 host123 kubelet[72693]: /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/google.golang.org/grpc/server.go:708 +0xa1
Feb 19 14:57:53 host123 systemd[1]: kubelet.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 19 14:57:53 host123 systemd[1]: Unit kubelet.service entered failed state.
Feb 19 14:57:53 host123 systemd[1]: kubelet.service failed.
Feb 19 14:58:03 host123 systemd[1]: kubelet.service holdoff time over, scheduling restart.
Feb 19 14:58:03 host123 systemd[1]: Stopped Kubelet.

A short term fix seems to be to turn off the docker system prune cron.