kubernetes: kubelet Image garbage collection failed: unable to find data for container /
Cluster 1.2.2 settings:
AWS_DEFAULT_PROFILE=default
export DOCKER_STORAGE=btrfs
export KUBERNETES_PROVIDER=aws
export KUBE_AWS_ZONE=us-west-2a
export KUBE_ENABLE_CLUSTER_LOGGING=false
export KUBE_ENABLE_CLUSTER_MONITORING=none
export MULTIZONE=1
export NODE_ROOT_DISK_SIZE=32
export NODE_SIZE=m4.xlarge
export NUM_NODES=5
When the node disk space is low enough to trigger GC, nothing happens.
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 32G 28G 3.0G 91% /
udev 10M 0 10M 0% /dev
tmpfs 3.2G 282M 2.9G 9% /run
tmpfs 7.9G 1.1M 7.9G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup
# journalctl -u kubelet | grep -i garbage
Apr 20 17:48:20 ip-172-20-0-149 kubelet[4441]: E0420 17:48:20.680505 4441 kubelet.go:956] Image garbage collection failed: unable to find data for container /
May 19 18:36:30 ip-172-20-0-149 kubelet[27507]: E0519 18:36:30.108168 27507 kubelet.go:956] Image garbage collection failed: unable to find data for container /
cAdvisor output looks OK:
# curl http://127.0.0.1:4194/validate/
...
Docker driver setup: [Supported and recommended]
Docker exec driver is native-0.2. Storage driver is aufs.
Docker container state directory is at "/var/lib/docker/containers" and is accessible.
Block device setup: [Supported and recommended]
At least one device supports 'cfq' I/O scheduler. Some disk stats can be reported.
Disk "xvda" Scheduler type "cfq".
...
Everything else in the cluster seems to be working. Any ideas on how to debug? For now I manually removed dangling and older Docker images.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 8
- Comments: 31 (14 by maintainers)
This error may or may not be benign. This error usually occurs when the kubelet tries to get metrics before the first metrics have been collected. This is normally not a problem, as the kubelet eventually retries, and should succeed once metrics collection has started.
@bamb00, by best guess is that this is benign in your case, since I see
Started kubelet v1.5.2
right before each error. This indicates to me that the kubelet just started. If the kubelet is restarting every minute continuously, you may have other problems. If you are continuously getting this error when the kubelet has not recently started, then there may still be issues with metrics collection.For anyone else who thinks they may be having metrics collection issues, look for the following log lines (in kubelet.log) to help debug: This indicates that we have started metrics collection:
Start housekeeping for container "/"
This indicates that stats collection failed, and may be a sign of problems:Failed to update stats for container "/"
@ichekrygin , @phagunbaya if you look into kubernetes source code across tags, you’ll see that https://github.com/kubernetes/kubernetes/pull/42916 was merged in v1.7.0-alpha.1.
Experiencing same on kubernetes v1.6.2 on azure.
curl http://127.0.0.1:4194/validate/
never returns response.