kubernetes: EmptyDir not being cleaned up after pod terminated with open file handles
What happened?
When pods are being terminated on our nodes they’re getting stuck in the terminating state and review of the kubelet logs shows that it’s unable to delete the emptyDir we have defined for this pod, as there is another process using the folder
39707 Sep 21 10:04 Error kubelet 0 E0921 10:04:47.581699 1752 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/empty-dir/e6461804-3188-4189-952b-9be22ef072f8-artifact-dir
podName:e6461804-3188-4189-952b-9be22ef072f8 nodeName:}" failed. No retries permitted until 2022-09-21 10:06:49.5816991 +0000 GMT m=+77903.602635501
(durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"artifact-dir\" (UniqueName:
\"kubernetes.io/empty-dir/e6461804-3188-4189-952b-9be22ef072f8-artifact-dir\") pod \"e6461804-3188-4189-952b-9be22ef072f8\" (UID:
\"e6461804-3188-4189-952b-9be22ef072f8\") : remove
c:\\var\\lib\\kubelet\\pods\\e6461804-3188-4189-952b-9be22ef072f8\\volumes\\kubernetes.io~empty-dir\\artifact-dir\\CBE_DL: The process cannot access the file because
it is being used by another process."
Reviewing the open file handles on the system it appears to only be kubelet that has one as all the containers within this pod have stopped running
PS C:\Windows\System32\config\systemprofile> handle64 e6461804-3188-4189-952b-9be22ef072f8
Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com
kubelet.exe pid: 1304 type: File A4C: C:\var\lib\kubelet\pods\e6461804-3188-4189-952b-9be22ef072f8\volumes\kubernetes.io~empty-dir\artifact-dir\CBE_DL
Manually closing the file handle kubelet had open with the sysinternals tool handle
then allowed kubelet to finish terminating the pod
handle -c A4C -p 1304
Nthandle v4.22 - Handle viewer
Copyright (C) 1997-2019 Mark Russinovich
Sysinternals - www.sysinternals.com
A4C: File (R--) C:\var\lib\kubelet\pods\e6461804-3188-4189-952b-9be22ef072f8\volumes\kubernetes.io~empty-dir\artifact-dir\CBE_DL
Close handle A4C in kubelet.exe (PID 1304)? (y/n) y
Handle closed.
What did you expect to happen?
Kubelet removed the directory and completed terminating the pod
How can we reproduce it (as minimally and precisely as possible)?
I’m not exactly sure why it’s getting stuck here. We have a python script download files in our init container into the folder, and then mount it in the main container. The only thing that I did wonder was because we’re mounting the volume at the base in the init container, but then using subPaths for the main container which the init container has setup within the volume.
Our pod spec as minamally as I figured was required would be like
apiVersion: v1
kind: Pod
spec:
containers:
- image: exampleImageName:latest
name: exampleImage
volumeMounts:
- mountPath: /Downloads
name: artifact-dir
subPath: CBE_DL
- mountPath: /ISAPI/Documents
name: artifact-dir
subPath: ISAPIDocuments
initContainers:
- image: initContainerImage:latest
name: download-artifacts
volumeMounts:
- mountPath: /DOWNLOADS
name: artifact-dir
nodeSelector:
kubernetes.io/os: windows
Anything else we need to know?
No response
Kubernetes version
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:45:37Z", GoVersion:"go1.16.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.14-eks-18ef993", GitCommit:"ac73613dfd25370c18cbbbc6bfc65449397b35c7", GitTreeState:"clean", BuildDate:"2022-07-06T18:06:50Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider
OS version
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
BuildNumber Caption OSArchitecture Version
17763 Microsoft Windows Server 2019 Datacenter 64-bit 10.0.17763
Install tools
Container runtime (CRI) and version (if applicable)
docker version
Server: Mirantis Container Runtime
Engine:
Version: 20.10.9
API version: 1.41 (minimum version 1.24)
Go version: go1.16.12m2
Git commit: 9b96ce992b
Built: 12/21/2021 21:33:06
OS/Arch: windows/amd64
Experimental: false
Related plugins (CNI, CSI, …) and versions (if applicable)
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 1
- Comments: 27 (12 by maintainers)
I can confirm the issue is reproduced without using
emptyDir
.Indeed, I agree. Another effect I encountered (though I need to attempt to reproduce to be sure) is that in some cases this causes notably higher CPU usage if the kubelet builds up large queue of failed cleanup operations that it keeps retrying.
Ahhh, thanks for the clarification!
I can confirm this issue is still occurring.
It can be minimally reproduced by creating a pod/deployment with at least 2 volume mounts with both being from either a configmap or secret. Using Sysinternals Process Explorer I was able to confirm that kubelet is the process that is maintaining the lock. For some yet to be known reason kubelet successfully deletes the directory for one of the volumes but fails to cleanup all all the volumes.
Might be able to help out with investigating the root cause and potentially providing a fix.
Repro’d using the YAML shared by Mark with the following platform details
I saw someone run into a similar issue and can reproduce the issue with the following yaml deployments
and
When you try and delete the pod it gets stuck in a terminating state and I see the following error messages in the kubelet log