longhorn: [BUG] Failure to create a volume on a retry since v1.1.1

Describe the bug We started upgrading our production clusters from v1.1.0 to v1.1.1 and found several issues. The root issue is timeouts (causing even a detach to last for tens of minutes) on CSI operations while and after undergoing the engine upgrades. We can’t pin the cause of this yet, but the timeouts uncovered another easily traceable issue.

When creating a new volume and the first attempt fails due to a timeout, the second attempt (as an automatic retry) fails in the CSI provisioner with the following error:

E0524 17:29:20.796731       1 controller.go:939] error syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf": failed to provision volume with StorageClass "longhorn-db": rpc error: code = AlreadyExists desc = CreateVolume: cannot change volume size from 109051904 to 107374183

It leads to a volume never reporting a successful provisioning even though it was created in Longhorn.

We managed to trace the logic and it happens due to volume size rounding in the CreateVolume function that never gets persisted to the original volume spec. So when a retry happens it finds the created volume through the LH API, but fails validation as the requested size differs (not rounded).

This should be easy to fix by rounding the size before trying to find it in LH.

Validation check: https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/csi/controller_server.go#L115-L119

Volume size rounding: https://github.com/longhorn/longhorn-manager/blob/ac065a8deb96193ffec142bcc7e7345615b57b8d/csi/controller_server.go#L192-L193

To Reproduce Steps to reproduce the behavior:

Create a new PV© with requested size that’s not a multiple of 2*1024*1024 (e.g. 107374183)
Make a first LH volume creation request fail somehow (in our case timeouts)
See the reported error that never goes away

Expected behavior The CSI operations should be retried with idempotency.

Log

W0524 17:29:20.796712       1 controller.go:916] Retrying syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf", failure 7                                                                                                                                                                                                    │
│ E0524 17:29:20.796731       1 controller.go:939] error syncing claim "63b55846-8795-41f6-8cbf-986a7a402ccf": failed to provision volume with StorageClass "longhorn-db": rpc error: code = AlreadyExists desc = CreateVolume: cannot change volume size from 109051904 to 107374183                                          │
│ I0524 17:29:20.796751       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-lh", UID:"63b55846-8795-41f6-8cbf-986a7a402ccf", APIVersion:"v1", ResourceVersion:"303698001", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume wi │
│ I0524 17:31:28.799565       1 controller.go:562] CreateVolumeRequest {Name:pvc-63b55846-8795-41f6-8cbf-986a7a402ccf CapacityRange:required_bytes:107374183  VolumeCapabilities:[mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER > ] Parameters:map[numberOfReplicas:3 recurringJobs:[{"cron":"10 */12 * * *","n │
│ I0524 17:31:28.799649       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-lh", UID:"63b55846-8795-41f6-8cbf-986a7a402ccf", APIVersion:"v1", ResourceVersion:"303698001", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning │
│ I0524 17:31:28.799632       1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":107374183},"name":"pvc-63b55846-8795-41f6-8cbf-986a7a402ccf","parameters":{"numberOfReplicas":"3","recurringJobs":"[{\"cron\":\"10 */12 * * *\",\"name\":\"backup\",\"retain\":2,\"task\":\"backup\"}]","staleReplicaTime │
│ I0524 17:31:28.808656       1 controller.go:1051] Final error received, removing PVC 63b55846-8795-41f6-8cbf-986a7a402ccf from claims in progress

Environment:

Longhorn version: 1.1.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher App
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: EKS 1.17
- Number of management node in the cluster: unknown
- Number of worker node in the cluster: 25
Node config
- OS type and version: EKS-optimized Amazon Linux 2
- CPU per node: varies
- Memory per node: varies
- Disk type(e.g. SSD/NVMe): NVMe
- Network bandwidth between the nodes: varies
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
Number of Longhorn volumes in the cluster: 400+

Additional context Although it’s an issue in itself, this wouldn’t be triggered without the timeouts, which are causing us significant issues (successfully getting a new PVC attached can take an hour) while upgrading the engines on production. This seems to come out of the scale as we couldn’t reproduce it on a much smaller scale staging clusters. It doesn’t seem to be related to the available resources as no OOM, throttling, iowaiting happens. The issue seems to temporarily resolve after rolling the longhorn-manager DaemonSet in the cluster. It even persists after stopping the engine upgrade. After the upgrade is complete it seems to work fine.

We’ll try to reproduce this in an isolated environment and report findings in a separate issue/Slack before upgrading the rest of our clusters.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 17 (13 by maintainers)

Most upvoted comments

Thanks @joshimoo It works like charm!

Reproduce/ Test steps:

Install Longhorn v1.1.1
Edit the timeout in csi-provisoner deployment to 2s (kubectl edit deployment csi-provisioner -n longhorn-system )
Create PVC with the size that is not multiple of 2MB:

  apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    name: longhorn-simple-pvc
  spec:
    accessModes:
      - ReadWriteOnce
    storageClassName: longhorn
    resources:
      requests:
        storage: 10485770 # 10MB + 10 bytes

See the error in the PVC’s event

failed to provision volume with StorageClass "longhorn": rpc error: code = AlreadyExists desc = CreateVolume: cannot change volume size from 12582912 to 10485770

Upgrade to Longhorn manager master image
See that the PVC successfully bounds

Successfully provisioned volume pvc-9b0cdee0-dc67-4843-9fa1-322066da8295

PhanLe1010 on Jun 8, 2021

When we have an RC build for v1.1.2, would you willing to try it to confirm that the issue has been fixed?

@yasker We’re still gradually rolling out 1.1.1 across the fleet. So far we’ve only encountered this issue in one (largest) cluster out of 3 upgraded, which was manually “fixed”. I’m not sure how to reliably test the RC with this fix unless this issue pops up on the other clusters later.

We’ve done our own internal load testing on a dedicated cluster and couldn’t reproduce the observed timeouts with a similar number of nodes and volumes, so we’re still unsure how to get into the same state that triggered this size bug to begin with. Perhaps there were some external factors at that time. We’ll keep observing while we roll out 1.1.1 and report if this reoccurs.

excieve on Jun 15, 2021

@PhanLe1010 for QA they can modify the timeout on the CSI provisioner deployment manually after it’s created. To change the timeout param. This seems to work just verified that it doesn’t get overwritten. So if they set the timeout to something like 5/10s the provisioner will trigger a second call which should go through the existVol call.

Just an idea though 😃

joshimoo on Jun 8, 2021

One issue shouldn’t track 2 different bugs. This issue is tracking the rounding bug. The slow volume operations bug is tracked at https://github.com/longhorn/longhorn/issues/2697

Closing this one

PhanLe1010 on Jun 16, 2021

@PhanLe1010 for an artifical test, you can let the csi create call timeout, before returning a response either in code. Or by setting the --timeout param of the provisioner to a low value. I changed this to 2m5s since the full creation needs to poll for the creation.

If you lower the timeout, since the existVol case is much faster i.e. no polling for creation, the second call should trigger the existing volume case. https://github.com/longhorn/longhorn-manager/blob/779149f6f3a8556abff08e12feec6ec0646c3831/csi/deployment.go#L138

But since both require a code change it’s probably more consistent to just spin before the initial creation response then terminate.

i.e. add this before the response, that will ensure that the first call times out so the second call will take the existing volume path. https://github.com/longhorn/longhorn-manager/blob/e5a418136cf3b9932203cd7c6202a91ce4a4f17a/csi/controller_server.go#L216

	if !cs.waitForVolumeState(resVol.Id, "volume creation idempotency test",
		func(vol *longhornclient.Volume) bool { return false }, true, false) {
		return nil, status.Error(codes.DeadlineExceeded, "time out the creation to test the idempotency")
	}

joshimoo on Jun 8, 2021

@excieve

Regrading to the timeout during the automatic engine upgrade process, what was the value you set for the setting Concurrent Automatic Engine Upgrade Per Node Limit?

I was able to produce the slow attaching/detaching behavior with a big value for the setting Concurrent Automatic Engine Upgrade Per Node Limit (e.g., 20). However, if the value is small (e.g., 2 or 3) I don’t see the problem. See below for more details.

Reproduce steps:

install Longhorn v1.1.0 in a 3 nodes cluster
Create 25 volumes. (I use the Python client to quickly create 25 volumes )
attach 20 volumes (vol1 to vol20) to node-1
Upgrade Longhorn to v1.1.1
Set Concurrent Automatic Engine Upgrade Per Node Limit to 20
Wait for the auto engine upgrade process to start
Attach the vol21 to node-1
I observe that it takes 96 seconds for vol21 to finish attaching
If I set Concurrent Automatic Engine Upgrade Per Node Limit to 2 in step 5, I observe that it only takes 10s for the vol21 to finish attaching

The above behavior makes sense because when Concurrent Automatic Engine Upgrade Per Node Limit is set to a high value, Longhorn tries to upgrade many volumes at the same time and it consumes resources on the nodes. See more at here

If you didn’t set a big value for the setting Concurrent Automatic Engine Upgrade Per Node Limit, then it could be something else. I have scanned over the automatic engine upgrade code base but haven’t found anything suspicious. I will continue to investigate when receiving your response.

PhanLe1010 on Jun 7, 2021