longhorn: [BUG] Expansion error still occurs in 1.4.2
Describe the bug (š if you encounter this issue)
The issue #5513 was documented as resolved in the release notes of 1.4.2. Yesterday I upgraded from 1.4.1 to 1.4.2. This morning, after a node restart, the error occured again.
Expansion Error: BUG: The expected size 2147483648 of engine pvc-97135cac-3890-42b1-bc32-80a0edec1b2e-e-6151baeb should not be smaller than the current size 5368709120
Note: You can cancel the expansion to avoid volume crash
Stopping the expansion throws an error, that the expansion has not started yet. Due to the node restart, we only have 1 replica instead of 3, because rebuild canāt start. And we are not able to recover from this state except restoring from backup. Because there still seems to be a probability that this issue can occur at any time and there is no good way to recover, we have to stop the rollout for version 1.4.2
To Reproduce
Itās not really reproduceable. It happens from time to time after a node restart.
Expected behavior
The issue should not occur. It would be at least helpful to know how to recover from this state.
Log or Support bundle
Environment
- Longhorn version: 1.4.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher Catalog App
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE1 / K8S v1.24.13
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 4
- Node config
- OS type and version: CentOS 7
- CPU per node: 16 vCPU
- Memory per node: 64GB
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes: 2 - 6 Gbit/s
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Cloud/KVM
- Number of Longhorn volumes in the cluster: 110
Additional context
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 32 (15 by maintainers)
cc @longhorn/dev
I was able to trigger this using the modifications I have made for #5845 on master-head. Because of those modifications, the ātargetedā replica was not expanded. However, there is clearly still the potential for incorrect expansion in all released versions of Longhorn.
I triggered it in a three node cluster with 50 volumes (150 replicas) all actively mounted to a toy ngnix workload. I used the below script to repeatedly kill an instance-manager pod. It took six iterations before a log message surfaced indicating that the modifications prevented an issue.
The killed instance-manager pod logged the following:
The longhorn-manager pod on the same node logged the following:
I have a complete support bundle from moments after it occurred, so hopefully I can understand what went wrong.
Longhorn v1.4.3 should be out tomorrow. It includes an additional fix for a reliable recreate we were tracking in https://github.com/longhorn/longhorn/issues/6217. If you are able, please upgrade your v1.4.2 cluster to v1.4.3 and see if the problem persists.
To be clear, if a volume has already been hit by the inappropriate expansion issue, the new fix will not recover it. However, we are very interested to know whether the fix prevents new inappropriate expansions. If you have data (good or bad) after an upgrade, please let us know here.
Upgraded our cluster to 1.4.3. Now observingā¦