longhorn: [BUG] Persistent volume is not ready for workloads

Describe the bug (🐛 if you encounter this issue)

Sometimes we encounter on issues when we are not able to mount a longhorn volume to the pod. Pod is not able to start and following errors are visible:

Kubernetes events for failing pod:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown

AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

To Reproduce

Problem cannot be easily reproduced - it fails randomly.

Expected behavior

Volumes work fine and are able to be mounted to the pods.

Support bundle for troubleshooting

We must not send support bundle due to security reason but we can provide logs and details - see below:

Kubernetes (and Longhorn) nodes:
- ip-X-X-X-57..compute.internal
- ip-X-X-X-142.compute.internal
- ip-X-X-X-140.compute.internal
Pod name: ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516
Pod events:

AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown

instance-manager on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 310
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 311
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 312
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 313
tgtd: bs_longhorn_request(97) fail to write at 10737352704 for 65536
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 65536 10737352704, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"

time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 314
tgtd: bs_longhorn_request(97) fail to write at 4337664 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 4337664, Success
response_process: Receive error for response 3 of seq 315
tgtd: bs_longhorn_request(97) fail to write at 37912576 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 37912576, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"

time="2023-09-18T03:08:20Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="failed to sync engine for longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: failed to start rebuild for pvc-506e824d-414f-43ce-af59-5821b2b9accf-r-6093cefb of pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: timed out waiting for the condition" node=ip-10-44-45-142.eu-central-1.compute.internal

longhorn-csi-plugin on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:07:34Z" level=error msg="ControllerPublishVolume: err: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads"

csi-attacher on ip-X-X-X-142.compute.internal node:

I0918 03:07:34.632251       1 csi_handler.go:234] Error processing "csi-635290b8ff08b07c1e7e1bdf2434aec2d8e8ef39dd611f725f8f3da595713bf5": failed to attach: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

longhorn-manager on ip-X-X-X-142.compute.internal node:

time="2023-09-18T03:08:00Z" level=error msg="Failed to rebuild replica X.X.X.245:10205" controller=longhorn-engine engine=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="proxyServer=X.X.X.201:8501 destination=X.X.X.201:10079: failed to add replica tcp://X.X.X.245:10205 for volume: rpc error: code = Unknown desc = failed to create replica tcp://X.X.X.245:10205 for volume X.X.X.201:10079: rpc error: code = Unknown desc = cannot get valid result for remain snapshot" node=ip-X-X-X-142.eu-central-1.compute.internal volume=pvc-506e824d-414f-43ce-af59-5821b2b9accf

time="2023-09-18T03:08:00Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf" controller=longhorn-volume error="failed to sync longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf: failed to reconcile volume state for pvc-506e824d-414f-43ce-af59-5821b2b9accf: no healthy or scheduled replica for starting" node=ip-X-X-X-142.eu-central-1.compute.internal

Environment

Longhorn version: v1.5.1
Installation method: helm
Kubernetes distro and version: AWS EKS, version v1.26.6
- Number of worker node in the cluster: 3
- Machine type: m5.4xlarge
Number of Longhorn volumes in the cluster: tens of volumes created dynamically as temporary storage for CICD builds (Longhorn + Argo Workflows)
Impacted Longhorn resources:
- Volume names: pvc-506e824d-414f-43ce-af59-5821b2b9accf (only example)

Additional context

Cluster autoscaler is enabled on the cluster - Kubernetes Cluster Autoscaler Enabled (Experimental) is enabled in Longhorn configuration

About this issue

Original URL
State: closed
Created 9 months ago
Reactions: 10
Comments: 38 (24 by maintainers)

Most upvoted comments

Log: https://github.com/longhorn/longhorn/issues/6776#issuecomment-1734829109
Problematic volume: pvc-506e824d-414f-43ce-af59-5821b2b9accf
Initial Statue
- volume pvc-506e824d-414f-43ce-af59-5821b2b9accf
  - status.state: detached
- sharemanager share-manager-pvc-506e824d-414f-43ce-af59-5821b2b9accf
  - status.state = stopped
  - status.ownerID = ip-x-x-x-57

Timestamp	Node	Log
2023-09-18T03:07:39	57	Starting share manager
2023-09-18T03:07:39Z	57	Created pod for share manager
2023-09-18T03:07:40Z	142	share manager pod is created on node 142, so the share manager’s ownership is transferred to 124 Volume got new owner ip-X-X-X-142.compute.internal
–	57	Volume detachment is triggered from node 57 by shouldDoDetach() because there is no attachment ticket.
2023-09-18T03:07:39	142	volume is requested to attach X.X.X.196 - - [18/Sep/2023:03:07:39 +0000] “POST /v1/volumes/pvc-506e824d-414f-43ce-af59-5821b2b9accf?action=attach HTTP/1.1” 200 7163 “” "Go-http-client/1.1”
2023-09-18T03:07:40	142	Volume got new owner ip-X-X-X-142.compute.internal

When the volume is requested to attach to node 142. In the meantime, the volume is triggered to detach from 57. The newly created replicas are mistakenly deleted by the detachment operation.

The script repeatedly attaches and detaches a pod using the share manager.

#!/bin/bash

ATTACH_WAIT_SECONDS=120
DETACH_WAIT_SECONDS=300

for ((i=0;i<200;i++))
do
        node=$((i%3+1))
        echo "Iteration #$i at `date` node ${node}"

				# Change sharemanager owner ID and trigger volumeattachment to request a new node
        status="status: {ownerID:  rancher60-worker${node}}"
        kubectl -n longhorn-system patch sharemanager pvc-1418f411-ccad-4df9-a66a-050cb80d4645  --type=merge --subresource status --patch "$status"

        kubectl apply -f pod1.yaml

        echo "Attaching"
        c=0
        while [ $c -lt $ATTACH_WAIT_SECONDS ]
        do
                phase=`kubectl get pod volume-test-1 -o=jsonpath="{['status.phase']}"  2>/dev/null`
                if [ x"$phase" == x"Running" ]; then
                        break
                fi

                sleep 1

                c=$((c+1))
                if [ x"$c" = x"$ATTACH_WAIT_SECONDS" ]; then
                        echo "Found error"
                        exit 1
                fi
        done

        echo "Detaching"
        kubectl delete -f pod1.yaml

        while [ $c -lt $DETACH_WAIT_SECONDS ]
        do
                phase=`kubectl -n longhorn-system get volumes pvc-1418f411-ccad-4df9-a66a-050cb80d4645 -o=jsonpath="{['status.state']}" 2>/dev/null`
                if [ x"$phase" == x"detached" ]; then
                        break
                fi

                sleep 1

                if [ x"$c" = x"DETACH_WAIT_SECONDS" ]; then
                        echo "Failed to detach"
                        exit 1
                fi
        done
done

sc_pvc.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-volv-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: longhorn-test
  resources:
    requests:
      storage: 2Gi

pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: volume-test-1
  namespace: default
spec:
  restartPolicy: Always
  containers:
  - name: volume-test-1
    image: ubuntu:20.04
    command:
      - "sleep"
      - "604800"
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: longhorn-volv
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: longhorn-volv
    persistentVolumeClaim:
      claimName: longhorn-volv-pvc

Currently, I cannot reproduce the faulty volume issue (all replicas are mistakenly deleted). But, from the test, we can sometimes observe one of the attaching volume’s replica instances is mistakenly deleted, so deleting all replicas mistakenly is possible.

What we can improve
- Improve the log in share manager controller and volume_attachment_controller for help debugging process
- Make sure there is no on-going replica deletion operation before attaching a volume
- Starting a share manger in current design is lengthy. Is it possible to simply the flow?
  - share manager is picked by the status.ownerID
  - volume is requested to attach to status.ownerID
  - share manager controller creates a share manager pod
  - If the pod is created on another node, share-manager ownership is transferred to another node
  - volume is requested to detach
  - volume is requested to attach

@james-munson Can you help take a look on this issue and see if I missed something? If possible, see if we can have a fix on the race issue.

cc @shuo-wu @innobead

derekbit on Oct 13, 2023

As we discussed last time, this part is problematic, which may lead to unexpected and unnecessary detachment after a temporary node being unavailable (kubelet/network down). In fact, keeping volume.Spec.NodeID the same as ShareManager.Status.OwnerID is unnecessary.

The share-manager-controller workflow can be like the following:

Start the share manager pod scheduled by Kubernetes.
Set the volume attachment ticket to the pod node
Unset the volume attachment ticket when the volume or the share manager pod is error or stopping/stopped

shuo-wu on Oct 20, 2023

Does the CI/CD process involve restarting nodes or pods? The share manager recovery backend logs that it is removing NFS client entries on multiple occasions.

Each step in pipeline is a separated pod with shared longhorn volume. It means that if you have 10 steps then you think about them like 10 pods with mounted shared longhorn volume.

@ajoskowski Could we have:

the yaml output of kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml

logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace

Is the workload pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 stuck right now?

Regarding kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml - no problem, but I am able to do it for current cluster shape (problematic volume was already removed) - volumes_attachments_engines_replicas.log
logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace - I do not see any logs which have information about pvc-506e824d-414f-43ce-af59-5821b2b9accf
Pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 was not able to start due to problem which I’ve described in the bug description and it was removed

In an effort to reproduce, do you think the following steps similar to your CI pipeline @ajoskowski ?

Create a RWX PVC

Create a deployment using the PVC

Repeatedly quickly scale up and down the deplooyment from 0 to 5 and back to 0

Verify if any pod stuck to come up

Yeah, it’s similar. On our side we create new pod definitions (new steps) instead of scaling of deployment but logic is the same - creating and removing pods with mounting/unmounting longhorn volume.

ajoskowski on Sep 27, 2023

The question is it valid for Longhorn? Should it support such cases? We use longhorn ~2 years in this way and we didn’t observer such problems in the past.

Got it. Longhorn introduces a new Attachment/Detachment mechanism since v1.5.0. Not sure if it is related and still under investigation. Ref: https://github.com/longhorn/longhorn/issues/3715

cc @PhanLe1010

derekbit on Sep 25, 2023

Thanks guys, we will verify this fix in 1.5.4

ajoskowski on Jan 2, 2024

@roger-ryao, that looks good. I would be up for building a 1.5.1-based private build (this fix is also being backported to 1.5.4) if @ajoskowski would be up for pre-testing it before 1.5.4 releases.

james-munson on Dec 14, 2023

I think this is a good candidate for backport to 1.4 and 1.5. @innobead do you agree?

james-munson on Dec 6, 2023

I also tried a repro with @Phan’s idea of using a deployment with an RWX volume, and scaling it up and down quickly. Specifically, used the rwx example from https://github.com/longhorn/longhorn/examples/rwx/rwx-nginx-deployment.yaml, although I modified the container slightly to do this

      containers:
        - image: ubuntu:xenial
          imagePullPolicy: IfNotPresent
          command: [ "/bin/sh", "-c" ]
          args:
            - sleep 10; touch /data/index.html; while true; do  echo "`hostname` `date`" >> /data/index.html; sleep 1; done;

to include the hostname in the periodic writes to the shared volume.

Even with scaling up to 3 and down to 0 at 10-second intervals (far faster than the attach and detach can be accomplished) no containers crashed, and no replicas were broken. Kubernetes and Longhorn are untroubled by the fact that creating and terminating resources overlap. In fact, I revised the time after scale up to 60 seconds, and the time after scale down to 0, so new pods were created immediately, and that just had the effect of attaching and writing from the new pods while the old ones were still detaching. So for some interval, there were 6 pods writing to the volume without trouble. I conclude from that test that this is likely not a representative repro of the situation in this issue.

james-munson on Nov 8, 2023

Thanks @ajoskowski ! The provided yaml https://github.com/longhorn/longhorn/files/12734412/volumes_attachments_engines_replicas.log doesn’t have anything abnormal as it is taken when the problem doesn’t exist.

We will try to reproduce the issue in lab.

PhanLe1010 on Sep 28, 2023

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

@ajoskowski Sorry, I got confused when reading the log file. Does the all-longhorn-manager.log include all longhorn-manager pods’ logs or only one longhorn-manager pod?

All logs from all instances of longhorn-managers - it means set of logs from 3 instances. Do you need to have it in separated files?

Yeah, some messages are mixed together. Can you help provide separate files? Thank you.

derekbit on Sep 26, 2023