kubernetes: ImageGCFailed, unable to delete images and reclaim disk.

Kubernetes version: 1.6.2 master and nodes Environment: GKE What happened: The following has been happening for the last week or two.

I noticed loads of pods being evicted with the following message from kubectl describe:

Node:		gke-prow-build-pool-a89df2af-4bc8/
Status:		Failed
Reason:		Evicted
Message:	The node was low on resource: nodefs.

The node shows ready but also that it has disk pressure from kubectl get no:

status:
  conditions:
  - lastHeartbeatTime: 2017-05-09T17:38:33Z
    lastTransitionTime: 2017-05-09T15:43:02Z
    message: kubelet has disk pressure
    reason: KubeletHasDiskPressure
    status: "True"
    type: DiskPressure
  - lastHeartbeatTime: 2017-05-09T17:38:33Z
    lastTransitionTime: 2017-05-09T01:07:06Z
    message: kubelet is posting ready status. AppArmor enabled
    reason: KubeletReady
    status: "True"
    type: Ready

kubectl describe no shows lots of ImageGCFailed.

  FirstSeen	LastSeen	Count	From						SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----						-------------	--------	------			-------
  4h		19s		564	kubelet, gke-prow-build-pool-a89df2af-4bc8			Warning		EvictionThresholdMet	Attempting to reclaim nodefs
  4h		18s		54	kubelet, gke-prow-build-pool-a89df2af-4bc8			Warning		ImageGCFailed		(events with common reason combined)

Kubelet logs show that it’s failing to delete the images and free up disk space. For each image it shows this every 10 seconds:

A  I0509 17:59:31.183907    1453 image_gc_manager.go:335] [imageGCManager]: Removing image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" to free 1105710474 bytes 
A  E0509 17:59:31.186643    1453 remote_image.go:124] RemoveImage "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" from image service failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30 
A  E0509 17:59:31.186705    1453 kuberuntime_image.go:126] Remove image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30 

What you expected to happen:

I would be happy if the node were marked unschedulable when it’s out of disk. I would also be happy if the images successfully clean up. As it is, the node just evicts any pod that attempts to run on it.

How to reproduce it:

I don’t know how to reproduce from scratch, but I’ve cordoned this node and can give access to someone for debugging.

Please let me know if you need more information, and apologies if this is a dupe.

cc @kubernetes/sig-node-bugs

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 39 (39 by maintainers)

Most upvoted comments

kubectl get no -ojson | jq -r '.items[] | select(.status.conditions[] | select(.status == "True") | select(.type == "DiskPressure")) | .metadata.name'

Lists nodes with disk pressure. There’s probably a better way.

@vishh shouldn’t we not be scheduling new pods to nodes with disk pressure?

#45896 has been merged, which should solve this issue. /close

PR, which I believe will make it in 1.7: https://github.com/kubernetes/kubernetes/pull/45896

I am looking at one of them, and all of the disk usage is coming from var/docker/overlay. I see ~150 containers, a couple using as much as 7G of space each

It doesn’t worry me that we fail to clean up images. This happens whenever we try to clean up an image belonging to a terminated container, but could also be indicative of issues with docker, which is why we report it. I am confused by three things:

  1. Why are we scheduling new pods to the node that has disk pressure?
  2. Why are new pods not being rejected by the kubelet?
  3. If we are evicting lots of pods, why are we unable to reduce disk pressure?