kubernetes: `failed to garbage collect required amount of images. Wanted to free 473842483 bytes, but freed 0 bytes`

What happened: I’ve been seeing a number of evictions recently that appear to be due to disk pressure:

$$$ kubectl get pod kumo-go-api-d46f56779-jl6s2 --namespace=kumo-main -o yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: 2018-12-06T10:05:25Z
  generateName: kumo-go-api-d46f56779-
  labels:
    io.kompose.service: kumo-go-api
    pod-template-hash: "802912335"
  name: kumo-go-api-d46f56779-jl6s2
  namespace: kumo-main
  ownerReferences:
  - apiVersion: extensions/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: kumo-go-api-d46f56779
    uid: c0a9355e-f780-11e8-b336-42010aa80057
  resourceVersion: "11617978"
  selfLink: /api/v1/namespaces/kumo-main/pods/kumo-go-api-d46f56779-jl6s2
  uid: 7337e854-f93e-11e8-b336-42010aa80057
spec:
  containers:
  - env:
    - redacted...
    image: gcr.io/<redacted>/kumo-go-api@sha256:c6a94fc1ffeb09ea6d967f9ab14b9a26304fa4d71c5798acbfba5e98125b81da
    imagePullPolicy: Always
    name: kumo-go-api
    ports:
    - containerPort: 5000
      protocol: TCP
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-t6jkx
      readOnly: true
  dnsPolicy: ClusterFirst
  nodeName: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-t6jkx
    secret:
      defaultMode: 420
      secretName: default-token-t6jkx
status:
  message: 'The node was low on resource: nodefs.'
  phase: Failed
  reason: Evicted
  startTime: 2018-12-06T10:05:25Z

Taking a look at kubectl get events, I see these warnings:

$$$ kubectl get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                                                   KIND      SUBOBJECT   TYPE      REASON          SOURCE                                                         MESSAGE
2m          13h          152       gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91   Node                  Warning   ImageGCFailed   kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s   (combined from similar events): failed to garbage collect required amount of images. Wanted to free 473948979 bytes, but freed 0 bytes
37m         37m          1         gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e3127ebc715c3   Node                  Warning   ImageGCFailed   kubelet, gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s   failed to garbage collect required amount of images. Wanted to free 473674547 bytes, but freed 0 bytes

Digging a bit deeper:

$$$ kubectl get event gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91 -o yaml
apiVersion: v1
count: 153
eventTime: null
firstTimestamp: 2018-12-07T11:01:06Z
involvedObject:
  kind: Node
  name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
  uid: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
kind: Event
lastTimestamp: 2018-12-08T00:16:09Z
message: '(combined from similar events): failed to garbage collect required amount
  of images. Wanted to free 474006323 bytes, but freed 0 bytes'
metadata:
  creationTimestamp: 2018-12-07T11:01:07Z
  name: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
  namespace: default
  resourceVersion: "381976"
  selfLink: /api/v1/namespaces/default/events/gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s.156e07f40b90ed91
  uid: 65916e4b-fa0f-11e8-ae9a-42010aa80058
reason: ImageGCFailed
reportingComponent: ""
reportingInstance: ""
source:
  component: kubelet
  host: gke-kumo-customers-n1-standard-1-pree-0cd7990c-jg9s
type: Warning

There’s actually remarkably little here. This message doesn’t say anything regarding why ImageGC was initiated or why it was unable recover more space.

What you expected to happen: Image GC to work correctly, or at least fail to schedule pods onto nodes that do not have sufficient disk space.

How to reproduce it (as minimally and precisely as possible): Run and stop as many pods as possible on a node in order to encourage disk pressure. Then observe these errors.

Anything else we need to know?: n/a

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.7", GitCommit:"0c38c362511b20a098d7cd855f1314dad92c2780", GitTreeState:"clean", BuildDate:"2018-08-20T10:09:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"10+", GitVersion:"v1.10.7-gke.11", GitCommit:"fa90543563c9cfafca69128ce8cd9ecd5941940f", GitTreeState:"clean", BuildDate:"2018-11-08T20:22:21Z", GoVersion:"go1.9.3b4", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: GKE
  • OS (e.g. from /etc/os-release): I’m running macOS 10.14, nodes are running Container-Optimized OS (cos).
  • Kernel (e.g. uname -a): Darwin D-10-19-169-80.dhcp4.washington.edu 18.0.0 Darwin Kernel Version 18.0.0: Wed Aug 22 20:13:40 PDT 2018; root:xnu-4903.201.2~1/RELEASE_X86_64 x86_64
  • Install tools: n/a
  • Others: n/a

/kind bug

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 54
  • Comments: 65 (11 by maintainers)

Most upvoted comments

Faced the same problem.

kubectl drain --delete-local-data --ignore-daemonsets $NODE_IP && kubectl uncordon $NODE_IP was enough to clean the disk storage.

FWIW “Boot disk size in GB (per node)” was set to the minimum, 10 Gb.

The issue for me was caused by a container taking a lot of disk space in a short amount of time. This happened in multiple nodes. The container was evicted (every pod in the node was), but the disk was not reclaimed by kubelet.

I had to du /var/lib/docker/overlay -h | sort -h in order to find which containers were doing this and manually delete them. This brought the nodes out of Disk Pressure and they recovered (one of them needed a reboot -f).

I just had the same issue on a customers’ RKE2 (v1.23.6+rke2r2) K8s 1.23.6 cluster. The errors in the Kubelet log were:

I0607 16:27:03.708243    7302 image_gc_manager.go:310] "Disk usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=89 highThreshold=85 amountToFree=4305076224
lowThreshold=80
E0607 16:27:03.710093    7302 kubelet.go:1347] "Image garbage collection failed multiple times in a row" err="failed to garbage collect required amount of images. Wanted to free 4305076224 bytes, but freed 0 bytes
"

The actual problem was caused by an “underlying” full /var partition. Originally, they once had Longhorn writing data to /var/lib/longhorn on a dedicated /var partition which then, at some point, nearly run full. To resolve this, they simply added another disk (& therefore partition) explicitly for /var/lib/longhorn (without deleting the old data on the original /var partition /var/lib/longhorn-directory before). So, a df -i or df -h showed quite some space/inodes available, but only because /var/lib/longhorn actually was a new partition/mount meanwhile.

To actually resolve this issue without downtime or any impact on the data read/write operations in the /var/lib/longhorn directory, we chose to bind mount the original /var partition and then cleaned the original /var/lib/longhorn/* data.

mkdir /mnt/temp-root
mount --bind /var /mnt/temp-root
ls -la /mnt/temp-root/lib/longhorn
rm -rf /mnt/temp-root/lib/longhorn/*
umount /mnt/temp-root
rmdir /mnt/temp-root

Source and explanation of a “bind mount”: https://unix.stackexchange.com/questions/198590/what-is-a-bind-mount

Thanks, @andrecp, for the hint regarding “dropped mounts” (https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-791794306)!!

Regards, Philip

I changed Node disk size from 10 GiB to 20GiB and error didn’t appeard

Please keep me out of this loop. Not sure why I am copied in this email

Thanks & Regards, Ashutosh Singh

On Mon, Apr 13, 2020, 00:21 Zhihong Yu notifications@github.com wrote:

@rubencabrera https://github.com/rubencabrera In the log you posted:

no such image: “sha256:redacted”

Did you have a chance to verify whether the underlying image existed or not ?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/71869#issuecomment-612684868, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADS6CKHTR2QTDJOWNKMLX23RMI5FXANCNFSM4GJFMSVA .

@hgokavarapuz Are you sure that this actually fixes the problem and doesn’t just require more image downloads for the bug to occur?

In my case this was happening within the GKE allowed disk sizes, so I’d say there’s definitely still some sort of bug in GKE here at least.

It would also be good to have some sort of official position on the minimum disk size required in order to run kubernetes on a node without getting this error. Otherwise it’s not clear exactly how large the volumes must be in order to be within spec for running kubernetes.

I had the same issue on EKS and changing node group default disk size from 20GB to 40GB helped

same issue here. I extended the ebs volumes thinking that would fix it. using ami k8s-1.10-debian-jessie-amd64-hvm-ebs-2018-08-17 (ami-009b9699070ffc46f)

I’m having some Garbage Collection and disk space issues too (I run microk8s). I’ve changed the settings for image-gc-high-threshold and image-gc-low-threshold, I’m looking now at changing maximum-dead-containers as the default is supposedly -1 which means infinite. This sounds like by default Garbage Collection isn’t really setup. This seems crazy! Im happy to be told otherwise. Would this (and theres a lot of posts on the microk8s repo too) ticket be resolved by simply having sane defaults for garbage collection?

It sounds like a not ideal default garbage collection would be better than none! I’ve also had this problem with docker in the past just keeping endless ephemeral disks after updating a container.

/remove-lifecycle stale

Has anyone figured out a way to persuade k8s garbage collection to fire on a disk which isn’t the root filesystem? We must use a secondary (SSD) disk for /var/lib/docker to address EKS performance issues (see https://github.com/awslabs/amazon-eks-ami/issues/454 ). But the garbage collection doesn’t fire and we sometimes overflow that secondary disk.