kubernetes: [Failing Test] [sig-storage] subPath should unmount if pod is deleted while kubelet is down (ci-kubernetes-e2e-gci-gce-serial)
Which jobs are failing:
ci-kubernetes-e2e-gci-gce-serial (gce-cos-master-serial)
Which test(s) are failing:
Four Kubernetes e2e suite.[sig-storage] * Volumes [Driver: *] [Testpattern: *] subPath should unmount if pod is * deleted while kubelet is down [Disruptive][Slow][LinuxOnly] tests:
- Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
- Kubernetes e2e suite.[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
- Kubernetes e2e suite.[sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
- Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
Since when has it been failing:
Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly] is failing since June 29th 2021 (but the first failure was different than more recent ones). Commit 1151dc1ee, infra commit a97a41ecc, build.
Other three tests are failing since July 9th 2021. Commit 617064d73, infra commit 27c63cf72, build.
Testgrid link:
Testgrid: https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-cos-master-serial&width=20 Prow history: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial
Reason for failure:
For CSI Volumes:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:348
Jul 9 10:01:28.012: while cleaning up resource
Unexpected error:
<errors.aggregate | len:1, cap:1>: [
[
{
msg: "persistent Volume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 not deleted by dynamic provisioner: PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s",
err: {
s: "PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s",
},
},
],
]
persistent Volume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 not deleted by dynamic provisioner: PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:183
For In-tree Volumes:
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:341
Jul 9 14:04:08.625: Expected grep stdout to be empty (i.e. no subpath mount found).
Expected
<string>: 963 355 0:40 /provisioning-7593/provisioning-7593 /var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw shared:27 - tmpfs tmpfs rw
966 125 0:40 /provisioning-7593/provisioning-7593 /var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
965 383 0:40 /provisioning-7593/provisioning-7593 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
964 381 0:40 /provisioning-7593/provisioning-7593 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
to be empty
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/utils.go:202
Anything else we need to know:
There is another open issues for failing tests in this job: https://github.com/kubernetes/kubernetes/issues/102865 Failures reported there are about missing metrics, and were occuring before unmounting problems. So I opened a new issue, even though it’s the same SIG. These tests don’t look related to me, but I don’t have enough context, maybe they somehow are.
/sig storage /kind failing-test /milestone 1.22
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 46 (44 by maintainers)
Commits related to this issue
- Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to jingxu97/kubernetes by jingxu97 3 years ago
- Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to ehashman/kubernetes by jingxu97 3 years ago
- Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to msau42/kubernetes by jingxu97 3 years ago
- Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to openshift/kubernetes by jingxu97 3 years ago
marking release-blocker per @ehashman note on slack: https://kubernetes.slack.com/archives/C2C40FMNF/p1627323614028800
/cc @kubernetes/release-team-leads
(my close/reopen yesterday was accidental, sorry)
We can use existing issue https://github.com/kubernetes/kubernetes/issues/79980 to track the real fix.
/close
There are three possible flows during kubelet restart
in pod gracefully delete case, the pod still exists (it is not considered as terminated in function isPodTerminated(); note that this logic has some changes after #102344, see comments, so pod volume will be added back to desired state, and reconciler will try to set up volume and fill in actual state. When pod volume is finally deleted from desired state, reconciler will trigger teardown with information from actual state to unpublish and unstage volume.
if pod is forcefully deleted, pod volume will not be added in desired state. If volume reconstruction succeeds (be able to reconstruct volume spec), volume will be added back into actual state. Then reconciler will trigger teardown with information from actual state to unpublish and unstage volume (this part is similar to case 1).
the last case is reconstruction failed (in this csi-host-path test, reconstruction will fail here due to issue related to isLikelyNotMountPoint, it fail to check mount point for csi hostpath driver because it is using a bind mount of the same device ). In this case, reconciler will try to clean up mounts directly, but only published volume path, not the staged volume path (this logic is not implemented due to missing information).
As a workaround, we can now disable check-volume-lifecycle if it does not cause other issues. Then we will work on to make sure csi-host-path driver can pass volume mountpoint check during reconstruction.
@verult @msau42 I was looking at this today too, trying to understand the underlying cause. That
--check-volume-lifecycleis why the failures are being reported now, but the part that I didn’t understand is why NodeUnstageVolume is not being called for these volumes. In this instance: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/1418942168139042816/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_subPath/should_unmount_if_pod_is_gracefully_deleted_while_kubelet_is_down_Disruptive_Slow_LinuxOnly_/csi-hostpathplugin-0-hostpath.log The DeleteVolume request fails because a staged path is still present:And in the kubelet log, NodeUnpublishVolume is called https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/1418942168139042816/artifacts/bootstrap-e2e-minion-group-wfgl/kubelet.log
Then we hit this error:
And NodeUnstageVolume is never called for this volume (same for other failures I looked at) so DeleteVolume keeps failing. So I think you’re right that it’s triggered by reconstruction, and it looks similar to https://github.com/kubernetes/kubernetes/issues/85476, except in this case the restart happened before NodeUnpublishVolume.