longhorn: [BUG] Application couldn't mount the volume after expanded the volume
Describe the bug Application (Harbor in this case) couldn’t mount volume after expanded on the Longhorn UI
To Reproduce Steps to reproduce the behavior:
- Scale down the application replica to 0
- Expand the volume from 200GB to 400GB on Longhorn UI
- The volume expanded successfully on the Longhorn UI
- Scale up the application replica to 1
- The application failed to mount the volume, seeing error
MountVolume.Setup failed while expanding volume for volumeon Rancher UI
Expected behavior The application should be able to mount the volume
Log If applicable, add the Longhorn managers’ log when the issue happens.
You can also attach a Support Bundle here. You can generate a Support Bundle using the link at the footer of the Longhorn UI.
Environment:
- Longhorn version: v1.0.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 10
- Node config
- OS type and version: RHEL 7.7
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
- Number of Longhorn volumes in the cluster: 40
Additional context
Problematic volume is pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d, it looks like the volume expansion was successful:
2021-03-24T23:51:44.700769039Z time="2021-03-24T23:51:44Z" level=info msg="Prepare to start frontend blockdev then try to expand the file system for volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"
2021-03-24T23:51:45.094561658Z time="2021-03-24T23:51:45Z" level=info msg="The volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d is unmounted, hence it will be temporarily mounted on /tmp/mnt/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d for file system expansion"
2021-03-24T23:51:45.629420993Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-1767df06] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:51:45.631298571Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-f06ca60b] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:51:45.632146834Z [pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d-r-44a7ef88] time="2021-03-24T23:51:45Z" level=info msg="Reloading the revision counter before processing the first write, the current revision cache is 1117780, the latest revision counter in file is 1117780"
2021-03-24T23:52:06.040960744Z time="2021-03-24T23:52:06Z" level=info msg="Succeeded to expand the file system for the volume "pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d
2021-03-24T23:52:06.04171147Z time="2021-03-24T23:52:06Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\", UID:\"bbad3afc-34f4-43d5-bc4e-85169a6df781\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"248509076\", FieldPath:\"\"}): type: 'Normal' reason: 'SucceededExpansion' Succeeds to expand the volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d to size 429496729600, will automatically detach it if it's not DR volume"
But there are Invalid volume state attached for expansion errors in the csi-resizer.log:
2021-03-25T02:53:40.722630606Z I0325 02:53:40.721645 1 connection.go:180] GRPC call: /csi.v1.Controller/ControllerExpandVolume
2021-03-25T02:53:40.722676042Z I0325 02:53:40.721766 1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"harbor", Name:"harbor-harbor-registry", UID:"2a83b182-2a43-4560-b3c8-b18cfcb2073d", APIVersion:"v1", ResourceVersion:"248672449", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d
2021-03-25T02:53:40.724432489Z I0325 02:53:40.721661 1 connection.go:181] GRPC request: {"capacity_range":{"required_bytes":429496729600},"volume_id":"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d"}
2021-03-25T02:53:40.725461649Z time="2021-03-25T02:53:40Z" level=info msg="GRPC call: /csi.v1.Controller/ControllerExpandVolume"
2021-03-25T02:53:40.730727271Z time="2021-03-25T02:53:40Z" level=info msg="GRPC request: {\"capacity_range\":{\"required_bytes\":429496729600},\"volume_id\":\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\"}"
2021-03-25T02:53:40.730829275Z time="2021-03-25T02:53:40Z" level=info msg="ControllerServer ControllerExpandVolume req: volume_id:\"pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d\" capacity_range:<required_bytes:429496729600 > "
2021-03-25T02:53:40.735159769Z 172.18.3.21 - - [25/Mar/2021:02:53:40 +0000] "GET /v1/volumes/pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d HTTP/1.1" 200 6934 "" "Go-http-client/1.1"
2021-03-25T02:53:40.735429951Z time="2021-03-25T02:53:40Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion"
2021-03-25T02:53:40.736007718Z I0325 02:53:40.735408 1 connection.go:183] GRPC response: {}
2021-03-25T02:53:40.736039847Z I0325 02:53:40.735906 1 connection.go:184] GRPC error: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion
2021-03-25T02:53:40.736118164Z E0325 02:53:40.735994 1 controller.go:360] Resize volume "pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d" by resizer "driver.longhorn.io" failed: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion
2021-03-25T02:53:40.736193951Z I0325 02:53:40.736103 1 event.go:209] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"harbor", Name:"harbor-harbor-registry", UID:"2a83b182-2a43-4560-b3c8-b18cfcb2073d", APIVersion:"v1", ResourceVersion:"248672449", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-2a83b182-2a43-4560-b3c8-b18cfcb2073d failed: rpc error: code = FailedPrecondition desc = Invalid volume state attached for expansion
We found the problematic PVC has resizing status in the condition section when we check with kubectl get pvc <PVC> -o yaml
We managed to resolve the issue by recreating a new PVC to pointing back to the expanded volume: https://kubernetes.io/docs/concepts/storage/persistent-volumes/#recovering-from-failure-when-expanding-volumes
gz#16045
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22 (16 by maintainers)
@khushboo-rancher great job on catching this 🚀 🚀 🚀
@cclhsu No your problem is actually caused by modifying the raw cached instance in the cache. https://github.com/longhorn/longhorn-manager/blob/acb92e26e0b94b157dba18764033db32fe8c54f6/manager/volume.go#L627 https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/datastore/kubernetes.go#L437
Since the above is an admission controller, the pvc would never actually get updated. But since the cached instance is modified, the next call on the same manager will compare the cached instance instead of the real pvc.
Just leaving some notes here related to the expansion/resizer:
We may need to modify the implementation of the UI expansion: If there is a PVC, we need to directly update the PVC size.
Previously, Longhorn needs to support Kubernetes v1.15, in which version the CSI resizer may be not available. But now, Longhorn supports Kubernetes v1.16+ only, the CSI resizer should be available by default. Hence we can directly rely on the CSI expansion when Longhorn receives an expansion request for a volume handled by a PVC from UI.
Here is a case how this could happen as mentioned during the meeting:
These 2 steps could be combined
Error happens here
Therefore pvc is stuck in
rezisingstate.Same issue for the persistent volume during creation / cleanup https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/datastore/kubernetes.go#L413
return deep copies or where needed rename the functions to add a *RO suffix
Per discussion with @khushboo-rancher on test step.
It seems that if we expand after creating PVC immediately, PVC is still in state of
Pendingand will encounter the error.If we wait a few seconds for PVC to turn into state
Bound, expanding will succeed.Should UI check the PVC condition before enable
expandoption?As Joshua mentioned above, the resizer error is by modifying the PVC size when the volume is attached.
As for the mount failure log, it’s probably a transient errors. The volume mount works finally: https://github.com/longhorn/longhorn/issues/1712