kubernetes: Orphaned pods fail to get cleaned up
Kubernetes version
Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.6", GitCommit:"e569a27d02001e343cb68086bc06d47804f62af6", GitTreeState:"clean", BuildDate:"2016-11-12T05:16:27Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}
Environment:
- Cloud provider or hardware configuration:
AWS
- OS (e.g. from /etc/os-release):
Ubuntu 16.04.1 LTS
- Kernel (e.g.
uname -a
):Linux 4.4.0-53-generic #74-Ubuntu SMP Fri Dec 2 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
What happened: syslogs are getting spammed every 2 seconds with these kubelet errors:
Dec 9 13:14:02 ip-10-50-242-179 start-kubelet.sh[31129]: E1209 13:14:02.300355 31129 kubelet_volumes.go:159] Orphaned pod "ff614192-bcc4-11e6-a20e-0a591a8e83d7" found, but error open /var/lib/kubelet/pods/ff614192-bcc4-11e6-a20e-0a591a8e83d7/volumes: no such file or directory occured during reading volume dir from disk
Dec 9 13:14:02 ip-10-50-242-179 start-kubelet.sh[31129]: E1209 13:14:02.300373 31129 kubelet_getters.go:249] Could not read directory /var/lib/kubelet/pods/ff769116-bcf4-11e6-a20e-0a591a8e83d7/volumes: open /var/lib/kubelet/pods/ff769116-bcf4-11e6-a20e-0a591a8e83d7/volumes: no such file or directory
We get the above 2 log entries for all the non-running pods (2150) every 2 seconds. So our logs get into the Gb pretty quickly
There are 2160 pods in /var/lib/kubelet/pods/
~# ls /var/lib/kubelet/pods/ | wc -l
2160
But only 10 are running and attached to volumes
~# df -h | grep kubelet
/dev/xvdf 256G 232M 256G 1% /var/lib/kubelet
tmpfs 7.4G 8.0K 7.4G 1% /var/lib/kubelet/pods/5b884f1f-bbcd-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/secrets
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/5b884f1f-bbcd-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-lfu24
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/15302286-bbaa-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-m0h9s
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/b0395433-a546-11e6-9670-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-n79fe
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/1198c11a-bd25-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-np531
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/473d7d51-bd25-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-smuz3
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/e17b1a95-bd36-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-1xs9g
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/2a36441b-bd57-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-qbw68
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/cf6c04f4-bd64-11e6-a20e-0a591a8e83d7/volumes/kubernetes.io~secret/default-token-n79fe
tmpfs 7.4G 8.0K 7.4G 1% /var/lib/kubelet/pods/24130c15-bdf5-11e6-98c0-0615e1fbbfc7/volumes/kubernetes.io~secret/secrets
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/24130c15-bdf5-11e6-98c0-0615e1fbbfc7/volumes/kubernetes.io~secret/default-token-9ksrm
tmpfs 7.4G 12K 7.4G 1% /var/lib/kubelet/pods/a271290c-bdf6-11e6-98c0-0615e1fbbfc7/volumes/kubernetes.io~secret/default-token-n79fe
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 2
- Comments: 59 (48 by maintainers)
Commits related to this issue
- Merge pull request #38909 from jingxu97/Dec/mounterfix Automatic merge from submit-queue (batch tested with PRs 38909, 39213) Add path exist check in getPodVolumePathListFromDisk Add the path exist... — committed to kubernetes/kubernetes by deleted user 8 years ago
- app-admin/kubelet-wrapper: mark kubelet datadir volume as a recursive mount So far `/var/lib/kubelet` was mounted as an implicit non-recursive mount. This changes the wrapper to an explicit recursive... — committed to lucab/coreos-overlay by lucab 7 years ago
- app-admin/kubelet-wrapper: mark kubelet datadir volume as a recursive mount So far `/var/lib/kubelet` was mounted as an implicit non-recursive mount. This changes the wrapper to an explicit recursive... — committed to euank/coreos-overlay by lucab 7 years ago
- app-admin/kubelet-wrapper: mark kubelet datadir volume as a recursive mount So far `/var/lib/kubelet` was mounted as an implicit non-recursive mount. This changes the wrapper to an explicit recursive... — committed to euank/coreos-overlay by lucab 7 years ago
- app-admin/kubelet-wrapper: mark kubelet datadir volume as a recursive mount So far `/var/lib/kubelet` was mounted as an implicit non-recursive mount. This changes the wrapper to an explicit recursive... — committed to ChrisMcKenzie/coreos-overlay by lucab 7 years ago
TL;DR: It seeoms to be a problem when running kubelet in rkt fly on CoreOS. I opened an issue at CoreOS (https://github.com/coreos/bugs/issues/1831).
Currently happens on the system:
But it can’t be moved, since it is still mounted. So like you mentioned, kubelet does not consider the volume to be a tmpfs.
Hmm. There was a crashed kubelet (out of space on this node)
Some mounts are gone.
Turn on more logging
Okey, let’s drill this one down, according to https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/empty_dir/empty_dir_linux.go#L37, just to be sure, this works like expected.
This works:
Trying this on the affected node, with the real file
Surprise, this is a tmpfs like expected. So what else could this be? I noticed that Type:61267 shows us, that we are a ext4 mount point. So likely kubelets hits /.
For sure, kubelet is running as a rkt fly container.
Well, this would have been discoverable without the lines of code. But anyway.
It is quite simple to reproduce this behavior. Just start a Pod with a Secret. Stop kubelet on the node, I left it off until the api server recognised. Started it again (with help kubelet-wrapper, of course) and waited until the api server showed this node as Ready. I made sure, the pod was still running on this node. After this, I just deleted it with kubectl. Voila, one more Orphand Pod with the same symptoms.
I’m trying to gather the logs now, but fyi after logging into the machine i saw this:
@jingxu97 we consider logs a bit sensitive, can I pass them to you on the kubernetes slack server?
In our case, we have a new cluster (no upgrade) using v1.5.1. and still see a lot of these errors:
I’m able to reproduce it as follows:
We are running kubelet as rkt(v1.20.0) container on CoreOS:
I experience the same issue when a k8s node is rebooted. After restart, kubelet is unable to clean the crashed docker containers. K8s version is 1.4.
As a quick-and-dirty workaround, container cleanup will continue if the “volumes” directory is created manually:
tail -n 100 /var/log/kubernetes/kubelet.log | grep "found, but error open /var/lib/kubelet/pods" | perl -pe 's#^.*(/var/lib/kubelet/pods/.+/volumes).+$#$1#'|sort -u|xargs mkdir