longhorn: [BUG] Failure to create a volume on a retry since v1.1.1
Describe the bug We started upgrading our production clusters from v1.1.0 to v1.1.1 and found several issues. The root issue is timeouts (causing even a detach to last for tens of minutes) on CSI operations while and after undergoing the engine upgrades. We can’t pin the cause of this yet, but the timeouts uncovered another easily traceable issue.
When creating a new volume and the first attempt fails due to a timeout, the second attempt (as an automatic retry) fails in the CSI provisioner with the following error:
E0524 17:29:20.796731 1 controller.go:939] error syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf": failed to provision volume with StorageClass "longhorn-db": rpc error: code = AlreadyExists desc = CreateVolume: cannot change volume size from 109051904 to 107374183
It leads to a volume never reporting a successful provisioning even though it was created in Longhorn.
We managed to trace the logic and it happens due to volume size rounding in the CreateVolume function that never gets persisted to the original volume spec. So when a retry happens it finds the created volume through the LH API, but fails validation as the requested size differs (not rounded).
This should be easy to fix by rounding the size before trying to find it in LH.
Validation check: https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/csi/controller_server.go#L115-L119
Volume size rounding: https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/csi/controller_server.go#L192-L193
To Reproduce Steps to reproduce the behavior:
- Create a new PV© with requested size that’s not a multiple of
2*1024*1024(e.g.107374183) - Make a first LH volume creation request fail somehow (in our case timeouts)
- See the reported error that never goes away
Expected behavior The CSI operations should be retried with idempotency.
Log
W0524 17:29:20.796712 1 controller.go:916] Retrying syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf", failure 7 │
│ E0524 17:29:20.796731 1 controller.go:939] error syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf": failed to provision volume with StorageClass "longhorn-db": rpc error: code = AlreadyExists desc = CreateVolume: cannot change volume size from 109051904 to 107374183 │
│ I0524 17:29:20.796751 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-lh", UID:"63b55846-8795-41f6-8cbf-986a7a402ccf", APIVersion:"v1", ResourceVersion:"303698001", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume wi │
│ I0524 17:31:28.799565 1 controller.go:562] CreateVolumeRequest {Name:pvc-63b55846-8795-41f6-8cbf-986a7a402ccf CapacityRange:required_bytes:107374183 VolumeCapabilities:[mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[numberOfReplicas:3 recurringJobs:[{"cron":"10 */12 * * *","n │
│ I0524 17:31:28.799649 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-lh", UID:"63b55846-8795-41f6-8cbf-986a7a402ccf", APIVersion:"v1", ResourceVersion:"303698001", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning │
│ I0524 17:31:28.799632 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":107374183},"name":"pvc-63b55846-8795-41f6-8cbf-986a7a402ccf","parameters":{"numberOfReplicas":"3","recurringJobs":"[{\"cron\":\"10 */12 * * *\",\"name\":\"backup\",\"retain\":2,\"task\":\"backup\"}]","staleReplicaTime │
│ I0524 17:31:28.808656 1 controller.go:1051] Final error received, removing PVC 63b55846-8795-41f6-8cbf-986a7a402ccf from claims in progress
Environment:
- Longhorn version: 1.1.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS 1.17
- Number of management node in the cluster: unknown
- Number of worker node in the cluster: 25
- Node config
- OS type and version: EKS-optimized Amazon Linux 2
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: varies
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
- Number of Longhorn volumes in the cluster: 400+
Additional context
Although it’s an issue in itself, this wouldn’t be triggered without the timeouts, which are causing us significant issues (successfully getting a new PVC attached can take an hour) while upgrading the engines on production. This seems to come out of the scale as we couldn’t reproduce it on a much smaller scale staging clusters. It doesn’t seem to be related to the available resources as no OOM, throttling, iowaiting happens. The issue seems to temporarily resolve after rolling the longhorn-manager DaemonSet in the cluster. It even persists after stopping the engine upgrade. After the upgrade is complete it seems to work fine.
We’ll try to reproduce this in an isolated environment and report findings in a separate issue/Slack before upgrading the rest of our clusters.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 17 (13 by maintainers)
Thanks @joshimoo It works like charm!
Reproduce/ Test steps:
timeoutincsi-provisonerdeployment to 2s (kubectl edit deployment csi-provisioner -n longhorn-system)@yasker We’re still gradually rolling out 1.1.1 across the fleet. So far we’ve only encountered this issue in one (largest) cluster out of 3 upgraded, which was manually “fixed”. I’m not sure how to reliably test the RC with this fix unless this issue pops up on the other clusters later.
We’ve done our own internal load testing on a dedicated cluster and couldn’t reproduce the observed timeouts with a similar number of nodes and volumes, so we’re still unsure how to get into the same state that triggered this size bug to begin with. Perhaps there were some external factors at that time. We’ll keep observing while we roll out 1.1.1 and report if this reoccurs.
@PhanLe1010 for QA they can modify the timeout on the CSI provisioner deployment manually after it’s created. To change the timeout param. This seems to work just verified that it doesn’t get overwritten. So if they set the timeout to something like 5/10s the provisioner will trigger a second call which should go through the existVol call.
Just an idea though 😃
One issue shouldn’t track 2 different bugs. This issue is tracking the rounding bug. The slow volume operations bug is tracked at https://github.com/longhorn/longhorn/issues/2697
Closing this one
@PhanLe1010 for an artifical test, you can let the csi create call timeout, before returning a response either in code. Or by setting the --timeout param of the provisioner to a low value. I changed this to 2m5s since the full creation needs to poll for the creation.
If you lower the timeout, since the existVol case is much faster i.e. no polling for creation, the second call should trigger the existing volume case. https://github.com/longhorn/longhorn-manager/blob/779149f6f3a8556abff08e12feec6ec0646c3831/csi/deployment.go#L138
But since both require a code change it’s probably more consistent to just spin before the initial creation response then terminate.
i.e. add this before the response, that will ensure that the first call times out so the second call will take the existing volume path. https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/csi/controller_server.go#L216
@excieve
Regrading to the timeout during the automatic engine upgrade process, what was the value you set for the setting
Concurrent Automatic Engine Upgrade Per Node Limit?I was able to produce the slow attaching/detaching behavior with a big value for the setting
Concurrent Automatic Engine Upgrade Per Node Limit(e.g., 20). However, if the value is small (e.g., 2 or 3) I don’t see the problem. See below for more details.Reproduce steps:
vol1tovol20) tonode-1Concurrent Automatic Engine Upgrade Per Node Limitto 20vol21tonode-1vol21to finish attachingConcurrent Automatic Engine Upgrade Per Node Limitto 2 in step 5, I observe that it only takes 10s for thevol21to finish attachingThe above behavior makes sense because when
Concurrent Automatic Engine Upgrade Per Node Limitis set to a high value, Longhorn tries to upgrade many volumes at the same time and it consumes resources on the nodes. See more at hereIf you didn’t set a big value for the setting
Concurrent Automatic Engine Upgrade Per Node Limit, then it could be something else. I have scanned over the automatic engine upgrade code base but haven’t found anything suspicious. I will continue to investigate when receiving your response.