longhorn: [BUG] Persistent volume is not ready for workloads

Describe the bug (🐛 if you encounter this issue)

Sometimes we encounter on issues when we are not able to mount a longhorn volume to the pod. Pod is not able to start and following errors are visible:

  • Kubernetes events for failing pod:
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown
AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads

To Reproduce

Problem cannot be easily reproduced - it fails randomly.

Expected behavior

Volumes work fine and are able to be mounted to the pods.

Support bundle for troubleshooting

We must not send support bundle due to security reason but we can provide logs and details - see below:

  • Kubernetes (and Longhorn) nodes:
    • ip-X-X-X-57..compute.internal
    • ip-X-X-X-142.compute.internal
    • ip-X-X-X-140.compute.internal
  • Pod name: ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516
  • Pod events:
AttachVolume.Attach failed for volume "pvc-506e824d-414f-43ce-af59-5821b2b9accf" : rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: chdir to cwd ("/src") set in config.json failed: stale NFS file handle: unknown
  • instance-manager on ip-X-X-X-142.compute.internal node:
time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 310
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 311
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 312
tgtd: bs_longhorn_request(111) fail to read at 0 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 28 -14 4096 0, Success
response_process: Receive error for response 3 of seq 313
tgtd: bs_longhorn_request(97) fail to write at 10737352704 for 65536
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 65536 10737352704, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
response_process: Receive error for response 3 of seq 314
tgtd: bs_longhorn_request(97) fail to write at 4337664 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 4337664, Success
response_process: Receive error for response 3 of seq 315
tgtd: bs_longhorn_request(97) fail to write at 37912576 for 4096
tgtd: bs_longhorn_request(210) io error 0xc27700 2a -14 4096 37912576, Success
[pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a] time="2023-09-18T03:08:01Z" level=error msg="I/O error" error="no backend available"
time="2023-09-18T03:08:20Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="failed to sync engine for longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: failed to start rebuild for pvc-506e824d-414f-43ce-af59-5821b2b9accf-r-6093cefb of pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a: timed out waiting for the condition" node=ip-10-44-45-142.eu-central-1.compute.internal
  • longhorn-csi-plugin on ip-X-X-X-142.compute.internal node:
time="2023-09-18T03:07:34Z" level=error msg="ControllerPublishVolume: err: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads"
  • csi-attacher on ip-X-X-X-142.compute.internal node:
I0918 03:07:34.632251       1 csi_handler.go:234] Error processing "csi-635290b8ff08b07c1e7e1bdf2434aec2d8e8ef39dd611f725f8f3da595713bf5": failed to attach: rpc error: code = Aborted desc = volume pvc-506e824d-414f-43ce-af59-5821b2b9accf is not ready for workloads
  • longhorn-manager on ip-X-X-X-142.compute.internal node:
time="2023-09-18T03:08:00Z" level=error msg="Failed to rebuild replica X.X.X.245:10205" controller=longhorn-engine engine=pvc-506e824d-414f-43ce-af59-5821b2b9accf-e-0df2341a error="proxyServer=X.X.X.201:8501 destination=X.X.X.201:10079: failed to add replica tcp://X.X.X.245:10205 for volume: rpc error: code = Unknown desc = failed to create replica tcp://X.X.X.245:10205 for volume X.X.X.201:10079: rpc error: code = Unknown desc = cannot get valid result for remain snapshot" node=ip-X-X-X-142.eu-central-1.compute.internal volume=pvc-506e824d-414f-43ce-af59-5821b2b9accf
time="2023-09-18T03:08:00Z" level=error msg="Failed to sync Longhorn volume longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf" controller=longhorn-volume error="failed to sync longhorn-system/pvc-506e824d-414f-43ce-af59-5821b2b9accf: failed to reconcile volume state for pvc-506e824d-414f-43ce-af59-5821b2b9accf: no healthy or scheduled replica for starting" node=ip-X-X-X-142.eu-central-1.compute.internal

Environment

  • Longhorn version: v1.5.1
  • Installation method: helm
  • Kubernetes distro and version: AWS EKS, version v1.26.6
    • Number of worker node in the cluster: 3
    • Machine type: m5.4xlarge
  • Number of Longhorn volumes in the cluster: tens of volumes created dynamically as temporary storage for CICD builds (Longhorn + Argo Workflows)
  • Impacted Longhorn resources:
    • Volume names: pvc-506e824d-414f-43ce-af59-5821b2b9accf (only example)

Additional context

Cluster autoscaler is enabled on the cluster - Kubernetes Cluster Autoscaler Enabled (Experimental) is enabled in Longhorn configuration

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Reactions: 10
  • Comments: 38 (24 by maintainers)

Most upvoted comments

Timestamp Node Log
2023-09-18T03:07:39 57 Starting share manager
2023-09-18T03:07:39Z 57 Created pod for share manager
2023-09-18T03:07:40Z 142 share manager pod is created on node 142, so the share manager’s ownership is transferred to 124 Volume got new owner ip-X-X-X-142.compute.internal
57 Volume detachment is triggered from node 57 by shouldDoDetach() because there is no attachment ticket.
2023-09-18T03:07:39 142 volume is requested to attach X.X.X.196 - - [18/Sep/2023:03:07:39 +0000] “POST /v1/volumes/pvc-506e824d-414f-43ce-af59-5821b2b9accf?action=attach HTTP/1.1” 200 7163 “” "Go-http-client/1.1”
2023-09-18T03:07:40 142 Volume got new owner ip-X-X-X-142.compute.internal

When the volume is requested to attach to node 142. In the meantime, the volume is triggered to detach from 57. The newly created replicas are mistakenly deleted by the detachment operation.

The script repeatedly attaches and detaches a pod using the share manager.

#!/bin/bash

ATTACH_WAIT_SECONDS=120
DETACH_WAIT_SECONDS=300

for ((i=0;i<200;i++))
do
        node=$((i%3+1))
        echo "Iteration #$i at `date` node ${node}"

				# Change sharemanager owner ID and trigger volumeattachment to request a new node
        status="status: {ownerID:  rancher60-worker${node}}"
        kubectl -n longhorn-system patch sharemanager pvc-1418f411-ccad-4df9-a66a-050cb80d4645  --type=merge --subresource status --patch "$status"

        kubectl apply -f pod1.yaml

        echo "Attaching"
        c=0
        while [ $c -lt $ATTACH_WAIT_SECONDS ]
        do
                phase=`kubectl get pod volume-test-1 -o=jsonpath="{['status.phase']}"  2>/dev/null`
                if [ x"$phase" == x"Running" ]; then
                        break
                fi

                sleep 1

                c=$((c+1))
                if [ x"$c" = x"$ATTACH_WAIT_SECONDS" ]; then
                        echo "Found error"
                        exit 1
                fi
        done

        echo "Detaching"
        kubectl delete -f pod1.yaml

        while [ $c -lt $DETACH_WAIT_SECONDS ]
        do
                phase=`kubectl -n longhorn-system get volumes pvc-1418f411-ccad-4df9-a66a-050cb80d4645 -o=jsonpath="{['status.state']}" 2>/dev/null`
                if [ x"$phase" == x"detached" ]; then
                        break
                fi

                sleep 1

                if [ x"$c" = x"DETACH_WAIT_SECONDS" ]; then
                        echo "Failed to detach"
                        exit 1
                fi
        done
done
  • sc_pvc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: longhorn-test
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  numberOfReplicas: "3"
  staleReplicaTimeout: "2880"
  fromBackup: ""
  fsType: "ext4"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-volv-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: longhorn-test
  resources:
    requests:
      storage: 2Gi
  • pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: volume-test-1
  namespace: default
spec:
  restartPolicy: Always
  containers:
  - name: volume-test-1
    image: ubuntu:20.04
    command:
      - "sleep"
      - "604800"
    imagePullPolicy: IfNotPresent
    volumeMounts:
    - name: longhorn-volv
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: longhorn-volv
    persistentVolumeClaim:
      claimName: longhorn-volv-pvc

Currently, I cannot reproduce the faulty volume issue (all replicas are mistakenly deleted). But, from the test, we can sometimes observe one of the attaching volume’s replica instances is mistakenly deleted, so deleting all replicas mistakenly is possible.

  • What we can improve
    • Improve the log in share manager controller and volume_attachment_controller for help debugging process
    • Make sure there is no on-going replica deletion operation before attaching a volume
    • Starting a share manger in current design is lengthy. Is it possible to simply the flow?
      • share manager is picked by the status.ownerID
      • volume is requested to attach to status.ownerID
      • share manager controller creates a share manager pod
      • If the pod is created on another node, share-manager ownership is transferred to another node
      • volume is requested to detach
      • volume is requested to attach

@james-munson Can you help take a look on this issue and see if I missed something? If possible, see if we can have a fix on the race issue.

cc @shuo-wu @innobead

As we discussed last time, this part is problematic, which may lead to unexpected and unnecessary detachment after a temporary node being unavailable (kubelet/network down). In fact, keeping volume.Spec.NodeID the same as ShareManager.Status.OwnerID is unnecessary.

The share-manager-controller workflow can be like the following:

  1. Start the share manager pod scheduled by Kubernetes.
  2. Set the volume attachment ticket to the pod node
  3. Unset the volume attachment ticket when the volume or the share manager pod is error or stopping/stopped

Does the CI/CD process involve restarting nodes or pods? The share manager recovery backend logs that it is removing NFS client entries on multiple occasions.

Each step in pipeline is a separated pod with shared longhorn volume. It means that if you have 10 steps then you think about them like 10 pods with mounted shared longhorn volume.

@ajoskowski Could we have:

  • the yaml output of kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml
  • logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace
  • Is the workload pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 stuck right now?
  • Regarding kubectl get volumes.longhorn.io,volumeattachments.longhorn.io,engines.longhorn.io,replicas.longhorn.io -n longhorn-system -oyaml - no problem, but I am able to do it for current cluster shape (problematic volume was already removed) - volumes_attachments_engines_replicas.log
  • logs from longhorn-csi-plugin-xxx pods in longhorn-system namespace - I do not see any logs which have information about pvc-506e824d-414f-43ce-af59-5821b2b9accf
  • Pod ci-state-pr-2463-env-doaks-prod6-uaenorth-v2wnw-override-refs-in-tf-modules-407269516 was not able to start due to problem which I’ve described in the bug description and it was removed

In an effort to reproduce, do you think the following steps similar to your CI pipeline @ajoskowski ?

  1. Create a RWX PVC
  2. Create a deployment using the PVC
  3. Repeatedly quickly scale up and down the deplooyment from 0 to 5 and back to 0
  4. Verify if any pod stuck to come up

Yeah, it’s similar. On our side we create new pod definitions (new steps) instead of scaling of deployment but logic is the same - creating and removing pods with mounting/unmounting longhorn volume.

The question is it valid for Longhorn? Should it support such cases? We use longhorn ~2 years in this way and we didn’t observer such problems in the past.

Got it. Longhorn introduces a new Attachment/Detachment mechanism since v1.5.0. Not sure if it is related and still under investigation. Ref: https://github.com/longhorn/longhorn/issues/3715

cc @PhanLe1010

Thanks guys, we will verify this fix in 1.5.4

@roger-ryao, that looks good. I would be up for building a 1.5.1-based private build (this fix is also being backported to 1.5.4) if @ajoskowski would be up for pre-testing it before 1.5.4 releases.

I think this is a good candidate for backport to 1.4 and 1.5. @innobead do you agree?

I also tried a repro with @Phan’s idea of using a deployment with an RWX volume, and scaling it up and down quickly. Specifically, used the rwx example from https://github.com/longhorn/longhorn/examples/rwx/rwx-nginx-deployment.yaml, although I modified the container slightly to do this

      containers:
        - image: ubuntu:xenial
          imagePullPolicy: IfNotPresent
          command: [ "/bin/sh", "-c" ]
          args:
            - sleep 10; touch /data/index.html; while true; do  echo "`hostname` `date`" >> /data/index.html; sleep 1; done;

to include the hostname in the periodic writes to the shared volume.

Even with scaling up to 3 and down to 0 at 10-second intervals (far faster than the attach and detach can be accomplished) no containers crashed, and no replicas were broken. Kubernetes and Longhorn are untroubled by the fact that creating and terminating resources overlap. In fact, I revised the time after scale up to 60 seconds, and the time after scale down to 0, so new pods were created immediately, and that just had the effect of attaching and writing from the new pods while the old ones were still detaching. So for some interval, there were 6 pods writing to the volume without trouble. I conclude from that test that this is likely not a representative repro of the situation in this issue.

Thanks @ajoskowski ! The provided yaml https://github.com/longhorn/longhorn/files/12734412/volumes_attachments_engines_replicas.log doesn’t have anything abnormal as it is taken when the problem doesn’t exist.

We will try to reproduce the issue in lab.

Longhorn manager logs from all (3) instances: all-longhorn-manager.log

@ajoskowski Sorry, I got confused when reading the log file. Does the all-longhorn-manager.log include all longhorn-manager pods’ logs or only one longhorn-manager pod?

All logs from all instances of longhorn-managers - it means set of logs from 3 instances. Do you need to have it in separated files?

Yeah, some messages are mixed together. Can you help provide separate files? Thank you.