longhorn: [BUG] Persistent volume is not ready for workloads
Describe the bug (🐛 if you encounter this issue)
Sometimes we encounter on issues when we are not able to mount a longhorn volume to the pod. Pod is not able to start and following errors are visible:
- Kubernetes events for failing pod:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown
AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads
To Reproduce
Problem cannot be easily reproduced - it fails randomly.
Expected behavior
Volumes work fine and are able to be mounted to the pods.
Support bundle for troubleshooting
We must not send support bundle due to security reason but we can provide logs and details - see below:
- Kubernetes (and Longhorn) nodes:
ip-X-X-X-57..compute.internalip-X-X-X-142.compute.internalip-X-X-X-140.compute.internal
- Pod name:
ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 - Pod events:
AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown
instance-manageronip-X-X-X-142.compute.internalnode:
time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 310
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 311
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 312
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 313
tgtd: bs_longhorn_request(97) fail to write at 10737352704 for 65536
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 65536 10737352704, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 314
tgtd: bs_longhorn_request(97) fail to write at 4337664 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 4337664, Success
response_process: Receive error for response 3 of seq 315
tgtd: bs_longhorn_request(97) fail to write at 37912576 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 37912576, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
time="2023-09-18T03:08:20Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="failed to sync engine for longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: failed to start rebuild for pvc-506e824d-414f-43ce-af59-5821b2b9accf-r-6093cefb of pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: timed out waiting for the condition" node=ip-10-44-45-142.eu-central-1.compute.internal
longhorn-csi-pluginonip-X-X-X-142.compute.internalnode:
time="2023-09-18T03:07:34Z" level=error msg="ControllerPublishVolume: err: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads"
csi-attacheronip-X-X-X-142.compute.internalnode:
I0918 03:07:34.632251 1 csi_handler.go:234] Error processing "csi-635290b8ff08b07c1e7e1bdf2434aec2d8e8ef39dd611f725f8f3da595713bf5": failed to attach: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads
longhorn-manageronip-X-X-X-142.compute.internalnode:
time="2023-09-18T03:08:00Z" level=error msg="Failed to rebuild replica X.X.X.245:10205" controller=longhorn-engine engine=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="proxyServer=X.X.X.201:8501 destination=X.X.X.201:10079: failed to add replica tcp://X.X.X.245:10205 for volume: rpc error: code = Unknown desc = failed to create replica tcp://X.X.X.245:10205 for volume X.X.X.201:10079: rpc error: code = Unknown desc = cannot get valid result for remain snapshot" node=ip-X-X-X-142.eu-central-1.compute.internal volume=pvc-506e824d-414f-43ce-af59-5821b2b9accf
time="2023-09-18T03:08:00Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf" controller=longhorn-volume error="failed to sync longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf: failed to reconcile volume state for pvc-506e824d-414f-43ce-af59-5821b2b9accf: no healthy or scheduled replica for starting" node=ip-X-X-X-142.eu-central-1.compute.internal
Environment
- Longhorn version:
v1.5.1 - Installation method:
helm - Kubernetes distro and version:
AWS EKS, versionv1.26.6- Number of worker node in the cluster: 3
- Machine type:
m5.4xlarge
- Number of Longhorn volumes in the cluster: tens of volumes created dynamically as temporary storage for CICD builds (Longhorn + Argo Workflows)
- Impacted Longhorn resources:
- Volume names:
pvc-506e824d-414f-43ce-af59-5821b2b9accf(only example)
- Volume names:
Additional context
Cluster autoscaler is enabled on the cluster - Kubernetes Cluster Autoscaler Enabled (Experimental) is enabled in Longhorn configuration
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Reactions: 10
- Comments: 38 (24 by maintainers)
When the volume is requested to attach to node 142. In the meantime, the volume is triggered to detach from 57. The newly created replicas are mistakenly deleted by the detachment operation.
The script repeatedly attaches and detaches a pod using the share manager.
Currently, I cannot reproduce the faulty volume issue (all replicas are mistakenly deleted). But, from the test, we can sometimes observe one of the attaching volume’s replica instances is mistakenly deleted, so deleting all replicas mistakenly is possible.
status.ownerIDstatus.ownerID@james-munson Can you help take a look on this issue and see if I missed something? If possible, see if we can have a fix on the race issue.
cc @shuo-wu @innobead
As we discussed last time, this part is problematic, which may lead to unexpected and unnecessary detachment after a temporary node being unavailable (kubelet/network down). In fact, keeping volume.Spec.NodeID the same as ShareManager.Status.OwnerID is unnecessary.
The share-manager-controller workflow can be like the following:
Each step in pipeline is a separated pod with shared longhorn volume. It means that if you have 10 steps then you think about them like 10 pods with mounted shared longhorn volume.
kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml- no problem, but I am able to do it for current cluster shape (problematic volume was already removed) - volumes_attachments_engines_replicas.logpvc-506e824d-414f-43ce-af59-5821b2b9accfci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516was not able to start due to problem which I’ve described in the bug description and it was removedYeah, it’s similar. On our side we create new pod definitions (new steps) instead of scaling of deployment but logic is the same - creating and removing pods with mounting/unmounting longhorn volume.
Got it. Longhorn introduces a new Attachment/Detachment mechanism since v1.5.0. Not sure if it is related and still under investigation. Ref: https://github.com/longhorn/longhorn/issues/3715
cc @PhanLe1010
Thanks guys, we will verify this fix in
1.5.4@roger-ryao, that looks good. I would be up for building a 1.5.1-based private build (this fix is also being backported to 1.5.4) if @ajoskowski would be up for pre-testing it before 1.5.4 releases.
I think this is a good candidate for backport to 1.4 and 1.5. @innobead do you agree?
I also tried a repro with @Phan’s idea of using a deployment with an RWX volume, and scaling it up and down quickly. Specifically, used the rwx example from https://github.com/longhorn/longhorn/examples/rwx/rwx-nginx-deployment.yaml, although I modified the container slightly to do this
to include the hostname in the periodic writes to the shared volume.
Even with scaling up to 3 and down to 0 at 10-second intervals (far faster than the attach and detach can be accomplished) no containers crashed, and no replicas were broken. Kubernetes and Longhorn are untroubled by the fact that creating and terminating resources overlap. In fact, I revised the time after scale up to 60 seconds, and the time after scale down to 0, so new pods were created immediately, and that just had the effect of attaching and writing from the new pods while the old ones were still detaching. So for some interval, there were 6 pods writing to the volume without trouble. I conclude from that test that this is likely not a representative repro of the situation in this issue.
Thanks @ajoskowski ! The provided yaml https://github.com/longhorn/longhorn/files/12734412/volumes_attachments_engines_replicas.log doesn’t have anything abnormal as it is taken when the problem doesn’t exist.
We will try to reproduce the issue in lab.
Yeah, some messages are mixed together. Can you help provide separate files? Thank you.