kubernetes: [Failing Test] [sig-storage] subPath should unmount if pod is deleted while kubelet is down (ci-kubernetes-e2e-gci-gce-serial)

Which jobs are failing:

ci-kubernetes-e2e-gci-gce-serial (gce-cos-master-serial)

Which test(s) are failing:

Four Kubernetes e2e suite.[sig-storage] * Volumes [Driver: *] [Testpattern: *] subPath should unmount if pod is * deleted while kubelet is down [Disruptive][Slow][LinuxOnly] tests:

Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
Kubernetes e2e suite.[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
Kubernetes e2e suite.[sig-storage] In-tree Volumes [Driver: hostPathSymlink] [Testpattern: Inline-volume (default fs)] subPath should unmount if pod is gracefully deleted while kubelet is down [Disruptive][Slow][LinuxOnly]
Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly]

Since when has it been failing:

Kubernetes e2e suite.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (default fs)] subPath should unmount if pod is force deleted while kubelet is down [Disruptive][Slow][LinuxOnly] is failing since June 29th 2021 (but the first failure was different than more recent ones). Commit 1151dc1ee, infra commit a97a41ecc, build.

Other three tests are failing since July 9th 2021. Commit 617064d73, infra commit 27c63cf72, build.

Testgrid link:

Testgrid: https://k8s-testgrid.appspot.com/sig-release-master-informing#gce-cos-master-serial&width=20 Prow history: https://prow.k8s.io/job-history/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial

Reason for failure:

For CSI Volumes:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:348
Jul  9 10:01:28.012: while cleaning up resource
Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        [
            {
                msg: "persistent Volume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 not deleted by dynamic provisioner: PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s",
                err: {
                    s: "PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s",
                },
            },
        ],
    ]
    persistent Volume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 not deleted by dynamic provisioner: PersistentVolume pvc-8c22a1cc-d586-4f3b-a6ce-eba8c7425fa6 still exists within 5m0s
occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:183

For In-tree Volumes:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/testsuites/subpath.go:341
Jul  9 14:04:08.625: Expected grep stdout to be empty (i.e. no subpath mount found).
Expected
    <string>: 963 355 0:40 /provisioning-7593/provisioning-7593 /var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw shared:27 - tmpfs tmpfs rw
    966 125 0:40 /provisioning-7593/provisioning-7593 /var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
    965 383 0:40 /provisioning-7593/provisioning-7593 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
    964 381 0:40 /provisioning-7593/provisioning-7593 /home/kubernetes/containerized_mounter/rootfs/var/lib/kubelet/pods/1cab2854-f6e2-400c-9136-9986d769b168/volume-subpaths/test-volume/test-container-subpath-inlinevolume-ttr8/0 rw,nosuid,nodev,noexec shared:27 - tmpfs tmpfs rw
    
to be empty
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/utils/utils.go:202

Anything else we need to know:

There is another open issues for failing tests in this job: https://github.com/kubernetes/kubernetes/issues/102865 Failures reported there are about missing metrics, and were occuring before unmounting problems. So I opened a new issue, even though it’s the same SIG. These tests don’t look related to me, but I don’t have enough context, maybe they somehow are.

/sig storage /kind failing-test /milestone 1.22

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 46 (44 by maintainers)

Commits related to this issue

Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to jingxu97/kubernetes by jingxu97 3 years ago
Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to ehashman/kubernetes by jingxu97 3 years ago
Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to msau42/kubernetes by jingxu97 3 years ago
Fix disruptive subPath test failures This PR fixes two distruptive subpath test failures. 1. disable --check-volume-lifecycle check 2. skip hostpath driver tests on graceful pod deletion test too. ... — committed to openshift/kubernetes by jingxu97 3 years ago

Most upvoted comments

marking release-blocker per @ehashman note on slack: https://kubernetes.slack.com/archives/C2C40FMNF/p1627323614028800

dims on Jul 26, 2021

/cc @kubernetes/release-team-leads

(my close/reopen yesterday was accidental, sorry)

lambdanis on Jul 20, 2021

We can use existing issue https://github.com/kubernetes/kubernetes/issues/79980 to track the real fix.

/close

msau42 on Aug 2, 2021

There are three possible flows during kubelet restart

in pod gracefully delete case, the pod still exists (it is not considered as terminated in function isPodTerminated(); note that this logic has some changes after #102344, see comments, so pod volume will be added back to desired state, and reconciler will try to set up volume and fill in actual state. When pod volume is finally deleted from desired state, reconciler will trigger teardown with information from actual state to unpublish and unstage volume.
if pod is forcefully deleted, pod volume will not be added in desired state. If volume reconstruction succeeds (be able to reconstruct volume spec), volume will be added back into actual state. Then reconciler will trigger teardown with information from actual state to unpublish and unstage volume (this part is similar to case 1).
the last case is reconstruction failed (in this csi-host-path test, reconstruction will fail here due to issue related to isLikelyNotMountPoint, it fail to check mount point for csi hostpath driver because it is using a bind mount of the same device ). In this case, reconciler will try to clean up mounts directly, but only published volume path, not the staged volume path (this logic is not implemented due to missing information).

As a workaround, we can now disable check-volume-lifecycle if it does not cause other issues. Then we will work on to make sure csi-host-path driver can pass volume mountpoint check during reconstruction.

jingxu97 on Jul 27, 2021

@verult @msau42 I was looking at this today too, trying to understand the underlying cause. That --check-volume-lifecycle is why the failures are being reported now, but the part that I didn’t understand is why NodeUnstageVolume is not being called for these volumes. In this instance: https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/1418942168139042816/artifacts/_sig-storage_CSI_Volumes/_Driver_csi-hostpath_/_Testpattern_Dynamic_PV_default_fs_subPath/should_unmount_if_pod_is_gracefully_deleted_while_kubelet_is_down_Disruptive_Slow_LinuxOnly_/csi-hostpathplugin-0-hostpath.log The DeleteVolume request fails because a staged path is still present:

E0724 17:28:16.113610       1 server.go:110] GRPC error: rpc error: code = Internal desc = Volume '4fa69976-eca4-11eb-8ee4-6acdf5008fc2' is still used (attached: false, staged: [/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-e4a92b67-2be1-4fd6-91c2-25e9d0d0898d/globalmount], published: []) by 'bootstrap-e2e-minion-group-wfgl' node

And in the kubelet log, NodeUnpublishVolume is called https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-serial/1418942168139042816/artifacts/bootstrap-e2e-minion-group-wfgl/kubelet.log

I0724 17:27:41.022546   13572 csi_client.go:357] kubernetes.io/csi: calling NodeUnpublishVolume rpc: [volid=4fa69976-eca4-11eb-8ee4-6acdf5008fc2, target_path=/var/lib/kubelet/pods/8cd17601-db7d-40eb-b048-74aa13971930/volumes/kubernetes.io~csi/pvc-e4a92b67-2be1-4fd6-91c2-25e9d0d0898d/mount

Then we hit this error:

E0724 17:27:41.028389   13572 operation_generator.go:881] UnmountVolume.MarkVolumeAsUnmounted failed for volume "" (UniqueName: "pvc-e4a92b67-2be1-4fd6-91c2-25e9d0d0898d") pod "8cd17601-db7d-40eb-b048-74aa13971930" (UID: "8cd17601-db7d-40eb-b048-74aa13971930") : no volume with the name "pvc-e4a92b67-2be1-4fd6-91c2-25e9d0d0898d" exists in the list of attached volumes

And NodeUnstageVolume is never called for this volume (same for other failures I looked at) so DeleteVolume keeps failing. So I think you’re right that it’s triggered by reconstruction, and it looks similar to https://github.com/kubernetes/kubernetes/issues/85476, except in this case the restart happened before NodeUnpublishVolume.

dobsonj on Jul 26, 2021