longhorn: [BUG] Application couldn't mount the volume after expanded the volume

Describe the bug Application (Harbor in this case) couldn’t mount volume after expanded on the Longhorn UI

To Reproduce Steps to reproduce the behavior:

  1. Scale down the application replica to 0
  2. Expand the volume from 200GB to 400GB on Longhorn UI
  3. The volume expanded successfully on the Longhorn UI
  4. Scale up the application replica to 1
  5. The application failed to mount the volume, seeing error MountVolume.Setup failed while expanding volume for volume on Rancher UI

Expected behavior The application should be able to mount the volume

Log If applicable, add the Longhorn managers’ log when the issue happens.

You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment:

  • Longhorn version: v1.0.2
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 10
  • Node config
    • OS type and version: RHEL 7.7
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster: 40

Additional context Problematic volume is pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d, it looks like the volume expansion was successful:

2021-03-24T23:51:44.700769039Z time="2021-03-24T23:51:44Z" level=info msg="Prepare to start frontend blockdev then try to expand the file system for volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"
2021-03-24T23:51:45.094561658Z time="2021-03-24T23:51:45Z" level=info msg="The volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d is unmounted, hence it will be temporarily mounted on /tmp/mnt/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d for file system expansion"
2021-03-24T23:51:45.629420993Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-1767df06] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:51:45.631298571Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-f06ca60b] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:51:45.632146834Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-44a7ef88] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:52:06.040960744Z time="2021-03-24T23:52:06Z" level=info msg="Succeeded to expand the file system for the volume "pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d
2021-03-24T23:52:06.04171147Z time="2021-03-24T23:52:06Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\", UID:\"bbad3afc-34f4-43d5-bc4e-85169a6df781\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"248509076\", FieldPath:\"\"}): type: 'Normal' reason: 'SucceededExpansion' Succeeds to expand the volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d to size 429496729600, will automatically detach it if it's not DR volume"

But there are Invalid volume state attached for expansion errors in the csi-resizer.log:

2021-03-25T02:53:40.722630606Z I0325 02:53:40.721645       1 connection.go:180] GRPC call: /csi.v1.Controller/ControllerExpandVolume
2021-03-25T02:53:40.722676042Z I0325 02:53:40.721766       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"harbor", Name:"harbor-harbor-registry", UID:"2a83b182-2a43-4560-b3c8-b18cfcb2073d", APIVersion:"v1", ResourceVersion:"248672449", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d
2021-03-25T02:53:40.724432489Z I0325 02:53:40.721661       1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":429496729600},"volume_id":"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"}
2021-03-25T02:53:40.725461649Z time="2021-03-25T02:53:40Z" level=info msg="GRPC call: /csi.v1.Controller/ControllerExpandVolume"
2021-03-25T02:53:40.730727271Z time="2021-03-25T02:53:40Z" level=info msg="GRPC request: {\"capacity_range\":{\"required_bytes\":429496729600},\"volume_id\":\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\"}"
2021-03-25T02:53:40.730829275Z time="2021-03-25T02:53:40Z" level=info msg="ControllerServer ControllerExpandVolume req: volume_id:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\" capacity_range:<required_bytes:429496729600 > "
2021-03-25T02:53:40.735159769Z 172.18.3.21 - - [25/Mar/2021:02:53:40 +0000] "GET /v1/volumes/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d HTTP/1.1" 200 6934 "" "Go-http-client/1.1"
2021-03-25T02:53:40.735429951Z time="2021-03-25T02:53:40Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion"
2021-03-25T02:53:40.736007718Z I0325 02:53:40.735408       1 connection.go:183] GRPC response: {}
2021-03-25T02:53:40.736039847Z I0325 02:53:40.735906       1 connection.go:184] GRPC error: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion
2021-03-25T02:53:40.736118164Z E0325 02:53:40.735994       1 controller.go:360] Resize volume "pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d" by resizer "driver.longhorn.io" failed: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion
2021-03-25T02:53:40.736193951Z I0325 02:53:40.736103       1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"harbor", Name:"harbor-harbor-registry", UID:"2a83b182-2a43-4560-b3c8-b18cfcb2073d", APIVersion:"v1", ResourceVersion:"248672449", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d failed: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion

We found the problematic PVC has resizing status in the condition section when we check with kubectl get pvc <PVC> -o yaml

We managed to resolve the issue by recreating a new PVC to pointing back to the expanded volume: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#recovering-from-failure-when-expanding-volumes

gz#16045

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22 (16 by maintainers)

Most upvoted comments

@khushboo-rancher great job on catching this 🚀 🚀 🚀

@cclhsu No your problem is actually caused by modifying the raw cached instance in the cache. https://github.com/longhorn/longhorn-manager/blob/acb92e26e0b94b157dba18764033db32fe8c54f6/manager/volume.go#L627 https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/datastore/kubernetes.go#L437

Since the above is an admission controller, the pvc would never actually get updated. But since the cached instance is modified, the next call on the same manager will compare the cached instance instead of the real pvc.

Just leaving some notes here related to the expansion/resizer:

We may need to modify the implementation of the UI expansion: If there is a PVC, we need to directly update the PVC size.

Previously, Longhorn needs to support Kubernetes v1.15, in which version the CSI resizer may be not available. But now, Longhorn supports Kubernetes v1.16+ only, the CSI resizer should be available by default. Hence we can directly rely on the CSI expansion when Longhorn receives an expansion request for a volume handled by a PVC from UI.

Here is a case how this could happen as mentioned during the meeting:

  • scale down workload
  • use longhorn ui to expand the volume
  • wait for expansion finish

These 2 steps could be combined

  • scale up workload
  • update pvc size to match new size

Error happens here

  • now csi-resizer cannot call expand / csi plugin errors out since we don’t support online expansion

Therefore pvc is stuck in rezising state.

Same issue for the persistent volume during creation / cleanup https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/datastore/kubernetes.go#L413

return deep copies or where needed rename the functions to add a *RO suffix

Expansion of volume with PVC not provisioned dynamically (Create volume, pv, pvc using LH GUI) - Fail - Inconsistent behavior - First time, UI shows error message unable to expand volume volume-test-1: persistentvolumeclaims "volume-test-1" is forbidden: only dynamically provisioned pvc can be resized and the storageclass that provisions the pvc must support resize and blocks the expansion. 2nd time, volume gets expanded.

Per discussion with @khushboo-rancher on test step.

  1. Create a volume using LH GUI.
  2. Create pv/pvc using LH GUI.
  3. Click expand and do the expansion - Shows error
  4. Click expand again and do expansion - works fine.

It seems that if we expand after creating PVC immediately, PVC is still in state of Pending and will encounter the error.

NAME   STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test   Pending    test     1Gi        RWO            longhorn       4s

If we wait a few seconds for PVC to turn into state Bound, expanding will succeed.

NAME   STATUS   VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test   Bound    test     1Gi        RWO            longhorn       25s

Should UI check the PVC condition before enable expand option?

As Joshua mentioned above, the resizer error is by modifying the PVC size when the volume is attached.

As for the mount failure log, it’s probably a transient errors. The volume mount works finally: https://github.com/longhorn/longhorn/issues/1712

2021-03-24T23:53:10.928118928Z time="2021-03-24T23:53:10Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume"
2021-03-24T23:53:10.930552124Z time="2021-03-24T23:53:10Z" level=info msg="GRPC request: {\"target_path\":\"/var/lib/kubelet/pods/cb3073a2-7f9f-499c-85b9-c95965565687/volumes/kubernetes.io~csi/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"baseImage\":\"\",\"fromBackup\":\"\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1579154955466-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\"}"
2021-03-24T23:53:10.930650724Z time="2021-03-24T23:53:10Z" level=info msg="NodeServer NodePublishVolume req: volume_id:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\" target_path:\"/var/lib/kubelet/pods/cb3073a2-7f9f-499c-85b9-c95965565687/volumes/kubernetes.io~csi/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d/mount\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"baseImage\" value:\"\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"numberOfReplicas\" value:\"3\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1579154955466-8081-driver.longhorn.io\" > "
2021-03-24T23:53:10.933349331Z time="2021-03-24T23:53:10Z" level=error msg="GRPC error: rpc error: code = InvalidArgument desc = There is no block device frontend for volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"
2021-03-24T23:53:11.980883489Z time="2021-03-24T23:53:11Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetCapabilities"
2021-03-24T23:53:11.981705602Z time="2021-03-24T23:53:11Z" level=info msg="GRPC request: {}"
2021-03-24T23:53:11.983570369Z time="2021-03-24T23:53:11Z" level=info msg="GRPC response: {\"capabilities\":[{\"Type\":{\"Rpc\":{\"type\":3}}}]}"
2021-03-24T23:53:12.002008284Z time="2021-03-24T23:53:12Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetCapabilities"
2021-03-24T23:53:12.003977656Z time="2021-03-24T23:53:12Z" level=info msg="GRPC request: {}"
2021-03-24T23:53:12.007987441Z time="2021-03-24T23:53:12Z" level=info msg="GRPC response: {\"capabilities\":[{\"Type\":{\"Rpc\":{\"type\":3}}}]}"
2021-03-24T23:53:12.010250653Z time="2021-03-24T23:53:12Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetCapabilities"
2021-03-24T23:53:12.01089224Z time="2021-03-24T23:53:12Z" level=info msg="GRPC request: {}"
2021-03-24T23:53:12.012280188Z time="2021-03-24T23:53:12Z" level=info msg="GRPC response: {\"capabilities\":[{\"Type\":{\"Rpc\":{\"type\":3}}}]}"
2021-03-24T23:53:12.013979146Z time="2021-03-24T23:53:12Z" level=info msg="GRPC call: /csi.v1.Node/NodeGetCapabilities"
2021-03-24T23:53:12.014730136Z time="2021-03-24T23:53:12Z" level=info msg="GRPC request: {}"
2021-03-24T23:53:12.016300926Z time="2021-03-24T23:53:12Z" level=info msg="GRPC response: {\"capabilities\":[{\"Type\":{\"Rpc\":{\"type\":3}}}]}"
2021-03-24T23:53:12.026208326Z time="2021-03-24T23:53:12Z" level=info msg="GRPC call: /csi.v1.Node/NodePublishVolume"
2021-03-24T23:53:12.029471023Z time="2021-03-24T23:53:12Z" level=info msg="GRPC request: {\"target_path\":\"/var/lib/kubelet/pods/cb3073a2-7f9f-499c-85b9-c95965565687/volumes/kubernetes.io~csi/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d/mount\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"baseImage\":\"\",\"fromBackup\":\"\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1579154955466-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\"}"
2021-03-24T23:53:12.029652373Z time="2021-03-24T23:53:12Z" level=info msg="NodeServer NodePublishVolume req: volume_id:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\" target_path:\"/var/lib/kubelet/pods/cb3073a2-7f9f-499c-85b9-c95965565687/volumes/kubernetes.io~csi/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d/mount\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"baseImage\" value:\"\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"numberOfReplicas\" value:\"3\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1579154955466-8081-driver.longhorn.io\" > "
2021-03-24T23:53:12.623907954Z time="2021-03-24T23:53:12Z" level=debug msg="NodePublishVolume: done MountVolume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"