longhorn: [BUG] Encrypted Volume does not expand beyond 15.5Ti

Describe the bug (🐛 if you encounter this issue)

A volume of mine has failed expanding and is now on attach-detach loop. What to do?

Engine failed or partially failed to expand the size at 2023-11-22T09:20:21.159044598Z: the expansion failed since all replica expansion failed: tcp://10.42.0.198:10040: error: code = ResultUnknown, message = failed to expand replica 10.42.0.198:10040 from remote: failed to unmarshal gRPC error message, gRPC err: rpc error: code = DeadlineExceeded desc = context deadline exceeded, json error: invalid character 'c' looking for beginning of value

The volume is saying that it has 0 bytes and in the filesystem it checks out. Its probably irrecoverable, which I don’t mind since I was migrating data from another volume to move it to another disk - since setting up a replica for a 15TB volume does not work for me as it always faults in the middle of it and starts over, never completing. But I would like to leave the issue here to see if there’s some process that might be improved or guideline on volume expansion.

Don’t know if relevant but the fs of the volume was at 100% usage.

To Reproduce

A bit difficult but attempt a volume expansion which fails.

Expected behavior

Volume expansion doesn’t fail or if it fails, volume is recovered.

Support bundle for troubleshooting

https://drive.google.com/file/d/1RrlKyeDOD52VqbKfIuAL8DgxnqMETDbn/view?usp=sharing

Environment

Longhorn version: 1.5.1
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s 1.26
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 1
Node config
- OS type and version: Fedora 38
- Kernel version: 6.5.10-200.fc38.x86_64
- CPU per node: 4
- Memory per node: 32GB
- Disk type(e.g. SSD/NVMe/HDD): HDD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 20
Impacted Longhorn resources: Volume with failed expansion
- Volume names: pvc-e19d261c-0fe8-41b0-81ea-bf9287d278ec

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 17 (11 by maintainers)

Most upvoted comments

Will do. Thanks @ejweber for looking into it.

Huge thanks to you as well @davidfrickert. If you hadn’t continued your analysis and determined the threshold over which the expansion would fail, I’d still be spinning my wheels.

ejweber on Dec 22, 2023

I see… that sounds quite weird indeed, it would be nice to check why this condition exists and document it plus develop a plan to remove this limitation!

I think we cannot remove this limitation in Longhorn. Fundamentally, v1 volumes work by creating sparse files that are the size of the volume you want. Even though we do not populate them with data until the workload writes some, the underlying file system must allow a file of the correct size to be created. (Though obviously we should fail in some way that doesn’t cause a SIGSEGV and corrupt your replica.)

I think the ultimate solution is either xfs or (eventually) the v2 engine. The v2 engine does not operate on top of a mounted file system, so this limitation will not exist.

Will you check https://github.com/longhorn/longhorn/issues/7423 and consider closing this issue out in favor of that one? Now that we know the real cause of the failure, I think it would be clearest to continue there.

ejweber on Dec 22, 2023

root@eweber-v126-worker-9c1451b4-kgxdq:~# df -T /
Filesystem     Type 1K-blocks    Used Available Use% Mounted on
/dev/vda1      ext4 162406320 8447280 153942656   6% /

root@eweber-v126-worker-9c1451b4-kgxdq:~# truncate -s 16T too_big
truncate: failed to truncate 'too_big' at 17592186044416 bytes: File too large

root@eweber-v126-worker-9c1451b4-kgxdq:~# truncate -s 16384G too_big
truncate: failed to truncate 'too_big' at 17592186044416 bytes: File too large

root@eweber-v126-worker-9c1451b4-kgxdq:~# truncate -s 16383G not_too_big
# No failure.

It seems that this is indeed a limitation of using ext4 as the underlying file system. A user in the above-linked issues switched to an underlying XFS file system and had success.

ejweber on Dec 22, 2023