kubernetes: CSI volume reconstruction does not work for ephemeral volumes

When a pod is marked as deleted while kubelet is down / being restarted, newly started kubelet does not clean up CSI filesystem volumes of the pod.

Newly started kubelet tries to reconstruct the volume using CSI’s ConstructVolumeSpec function. This part looks working, CSI volume plugin loads its json file.

But then VolumeManager checks if the volume is still mounted in /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount directory.

There are two issues:

  1. CSI does not require volumes to be presented as mounts. They can be just directories with files on them. This will be case of the most of in-line volumes.

  2. Even if the CSI driver used mount, kubelet mounts it into /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d/mount. Checking of /var/lib/kubelet/pods/9440e7e5-d454-4555-84b7-d72e43ec4b3a/volumes/kubernetes.io~csi/pvc-45640a32-4ba3-4a7d-ad4b-087281f1460d does not make sense. Kubelet checks the right directory given by GetPath()

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 29 (23 by maintainers)

Commits related to this issue

Most upvoted comments

  1. The IsLikelyNotMountPoint() call that verult already called out, which is known not to work for bind mounts. The mountpoint never gets found, so volume reconstruction fails with “… is not mounted”, the PVC never never gets added to the list of attached volumes, and that’s why we end up hitting this error:

There is an ongoing work that changes IsNotMountPoint to utilize openat2(2) syscall to detect mount point by using MountedFast. By using openat2(2), bind mount will be properly detected fast, but it requires kernel version is v5.6 or later. So, we would be able to also utilize openat2(2) in IsLikelyNotMountPoint. Then, the issue of bind mount can be resolved at least for kernel v5.6+, and we will be able to focus on how we solve this issue for old kernels.

I added a couple of extra debug lines to that lines in the actual_state_of_the_world.go file, and can see that when DeletePodFromVolume tries to see if the volume exists, it doesn’t find it in the asw.attachedVolumes struct, like so:

func (asw *actualStateOfWorld) DeletePodFromVolume(
	podName volumetypes.UniquePodName, volumeName v1.UniqueVolumeName) error {
	asw.Lock()
	defer asw.Unlock()

	volumeObj, volumeExists := asw.attachedVolumes[volumeName]
	klog.InfoS("DEBUG:", "volumeName", volumeName)
	klog.InfoS("DEBUG:", "volumeObj", volumeObj, "volumeExists", volumeExists, "asw.attachedVolumes", asw.attachedVolumes)
	klog.Info("DEBUG:", "volumeExists",volumeExists)
	if !volumeExists {
		return fmt.Errorf(
			"no volume with the name %q exists in the list of attached volumes",
			volumeName)
	}

	_, podExists := volumeObj.mountedPods[podName]
	if podExists {
		delete(asw.attachedVolumes[volumeName].mountedPods, podName)
	}

	return nil
}
Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.677329    4740 actual_state_of_world.go:662] "DEBUG:" volumeName=pvc-295aa817-8cdd-4dac-818a-6c81b79778f5
Oct 07 00:04:30 e2e-test-build-minion-group-klvd kubelet[4740]: I1007 00:04:30.679563    4740 actual_state_of_world.go:663] "DEBUG:" volumeObj={volumeName: mountedPods:map[] spec:<nil> pluginName: pluginIsAttachable:false deviceMountState: devicePath: deviceMountPath: volumeInUseErrorForExpansion:false} volumeExists=false

Probably we should search for the name of the attached volume differently for ephemeral volumes?