longhorn: [BUG] Encrypted Volume does not expand beyond 15.5Ti
Describe the bug (š if you encounter this issue)
A volume of mine has failed expanding and is now on attach-detach loop. What to do?
Engine failed or partially failed to expand the size at 2023-11-22T09:20:21.159044598Z: the expansion failed since all replica expansion failed: tcp://10.42.0.198:10040: error: code = ResultUnknown, message = failed to expand replica 10.42.0.198:10040 from remote: failed to unmarshal gRPC error message, gRPC err: rpc error: code = DeadlineExceeded desc = context deadline exceeded, json error: invalid character 'c' looking for beginning of value
The volume is saying that it has 0 bytes and in the filesystem it checks out. Its probably irrecoverable, which I donāt mind since I was migrating data from another volume to move it to another disk - since setting up a replica for a 15TB volume does not work for me as it always faults in the middle of it and starts over, never completing. But I would like to leave the issue here to see if thereās some process that might be improved or guideline on volume expansion.
Donāt know if relevant but the fs of the volume was at 100% usage.
To Reproduce
A bit difficult but attempt a volume expansion which fails.
Expected behavior
Volume expansion doesnāt fail or if it fails, volume is recovered.
Support bundle for troubleshooting
https://drive.google.com/file/d/1RrlKyeDOD52VqbKfIuAL8DgxnqMETDbn/view?usp=sharing
Environment
- Longhorn version: 1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s 1.26
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 1
- Node config
- OS type and version: Fedora 38
- Kernel version: 6.5.10-200.fc38.x86_64
- CPU per node: 4
- Memory per node: 32GB
- Disk type(e.g. SSD/NVMe/HDD): HDD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 20
- Impacted Longhorn resources: Volume with failed expansion
- Volume names: pvc-e19d261c-0fe8-41b0-81ea-bf9287d278ec
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 17 (11 by maintainers)
Huge thanks to you as well @davidfrickert. If you hadnāt continued your analysis and determined the threshold over which the expansion would fail, Iād still be spinning my wheels.
I think we cannot remove this limitation in Longhorn. Fundamentally, v1 volumes work by creating sparse files that are the size of the volume you want. Even though we do not populate them with data until the workload writes some, the underlying file system must allow a file of the correct size to be created. (Though obviously we should fail in some way that doesnāt cause a SIGSEGV and corrupt your replica.)
I think the ultimate solution is either
xfsor (eventually) the v2 engine. The v2 engine does not operate on top of a mounted file system, so this limitation will not exist.Will you check https://github.com/longhorn/longhorn/issues/7423 and consider closing this issue out in favor of that one? Now that we know the real cause of the failure, I think it would be clearest to continue there.
It seems that this is indeed a limitation of using ext4 as the underlying file system. A user in the above-linked issues switched to an underlying XFS file system and had success.