kubernetes: kubelet Image garbage collection failed: unable to find data for container /

Cluster 1.2.2 settings:

AWS_DEFAULT_PROFILE=default

export DOCKER_STORAGE=btrfs
export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=us-west-2a
export KUBE_ENABLE_CLUSTER_LOGGING=false
export KUBE_ENABLE_CLUSTER_MONITORING=none
export MULTIZONE=1
export NODE_ROOT_DISK_SIZE=32
export NODE_SIZE=m4.xlarge
export NUM_NODES=5

When the node disk space is low enough to trigger GC, nothing happens.

# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1       32G   28G  3.0G  91% /
udev             10M     0   10M   0% /dev
tmpfs           3.2G  282M  2.9G   9% /run
tmpfs           7.9G  1.1M  7.9G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.9G     0  7.9G   0% /sys/fs/cgroup

# journalctl -u kubelet | grep -i garbage
Apr 20 17:48:20 ip-172-20-0-149 kubelet[4441]: E0420 17:48:20.680505    4441 kubelet.go:956] Image garbage collection failed: unable to find data for container /
May 19 18:36:30 ip-172-20-0-149 kubelet[27507]: E0519 18:36:30.108168   27507 kubelet.go:956] Image garbage collection failed: unable to find data for container /

cAdvisor output looks OK:

# curl http://127.0.0.1:4194/validate/
...
Docker driver setup: [Supported and recommended]
    Docker exec driver is native-0.2. Storage driver is aufs.
    Docker container state directory is at "/var/lib/docker/containers" and is accessible.


Block device setup: [Supported and recommended]
    At least one device supports 'cfq' I/O scheduler. Some disk stats can be reported.
     Disk "xvda" Scheduler type "cfq".
...

Everything else in the cluster seems to be working. Any ideas on how to debug? For now I manually removed dangling and older Docker images.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 8
  • Comments: 31 (14 by maintainers)

Commits related to this issue

Most upvoted comments

This error may or may not be benign. This error usually occurs when the kubelet tries to get metrics before the first metrics have been collected. This is normally not a problem, as the kubelet eventually retries, and should succeed once metrics collection has started.

@bamb00, by best guess is that this is benign in your case, since I see Started kubelet v1.5.2 right before each error. This indicates to me that the kubelet just started. If the kubelet is restarting every minute continuously, you may have other problems. If you are continuously getting this error when the kubelet has not recently started, then there may still be issues with metrics collection.

For anyone else who thinks they may be having metrics collection issues, look for the following log lines (in kubelet.log) to help debug: This indicates that we have started metrics collection: Start housekeeping for container "/" This indicates that stats collection failed, and may be a sign of problems: Failed to update stats for container "/"

@ichekrygin , @phagunbaya if you look into kubernetes source code across tags, you’ll see that https://github.com/kubernetes/kubernetes/pull/42916 was merged in v1.7.0-alpha.1.

Experiencing same on kubernetes v1.6.2 on azure.

curl http://127.0.0.1:4194/validate/ never returns response.