kubernetes: ImageGCFailed, unable to delete images and reclaim disk.
Kubernetes version: 1.6.2 master and nodes Environment: GKE What happened: The following has been happening for the last week or two.
I noticed loads of pods being evicted with the following message from kubectl describe:
Node: gke-prow-build-pool-a89df2af-4bc8/
Status: Failed
Reason: Evicted
Message: The node was low on resource: nodefs.
The node shows ready but also that it has disk pressure from kubectl get no:
status:
conditions:
- lastHeartbeatTime: 2017-05-09T17:38:33Z
lastTransitionTime: 2017-05-09T15:43:02Z
message: kubelet has disk pressure
reason: KubeletHasDiskPressure
status: "True"
type: DiskPressure
- lastHeartbeatTime: 2017-05-09T17:38:33Z
lastTransitionTime: 2017-05-09T01:07:06Z
message: kubelet is posting ready status. AppArmor enabled
reason: KubeletReady
status: "True"
type: Ready
kubectl describe no shows lots of ImageGCFailed.
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4h 19s 564 kubelet, gke-prow-build-pool-a89df2af-4bc8 Warning EvictionThresholdMet Attempting to reclaim nodefs
4h 18s 54 kubelet, gke-prow-build-pool-a89df2af-4bc8 Warning ImageGCFailed (events with common reason combined)
Kubelet logs show that it’s failing to delete the images and free up disk space. For each image it shows this every 10 seconds:
A I0509 17:59:31.183907 1453 image_gc_manager.go:335] [imageGCManager]: Removing image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" to free 1105710474 bytes
A E0509 17:59:31.186643 1453 remote_image.go:124] RemoveImage "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" from image service failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30
A E0509 17:59:31.186705 1453 kuberuntime_image.go:126] Remove image "sha256:fa60023475d842a7a62d38fa27a0d3f6fd672be5ea1f09e6d07f8459d2c0c60a" failed: rpc error: code = 2 desc = Error response from daemon: conflict: unable to delete fa60023475d8 (must be forced) - image is being used by stopped container 8641d5395d30
What you expected to happen:
I would be happy if the node were marked unschedulable when it’s out of disk. I would also be happy if the images successfully clean up. As it is, the node just evicts any pod that attempts to run on it.
How to reproduce it:
I don’t know how to reproduce from scratch, but I’ve cordoned this node and can give access to someone for debugging.
Please let me know if you need more information, and apologies if this is a dupe.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 39 (39 by maintainers)
Lists nodes with disk pressure. There’s probably a better way.
@vishh shouldn’t we not be scheduling new pods to nodes with disk pressure?
#45896 has been merged, which should solve this issue. /close
PR, which I believe will make it in 1.7: https://github.com/kubernetes/kubernetes/pull/45896
I am looking at one of them, and all of the disk usage is coming from var/docker/overlay. I see ~150 containers, a couple using as much as 7G of space each
It doesn’t worry me that we fail to clean up images. This happens whenever we try to clean up an image belonging to a terminated container, but could also be indicative of issues with docker, which is why we report it. I am confused by three things: