longhorn: [BUG] Volumes failing to mount because of engine upgradedReplicaAddressMap reference

Describe the bug (🐛 if you encounter this issue)

Pods mounting longhorn volumes stay on ContainerCreating because of DeadlineExceeded timeouts:

Warning FailedAttachVolume 2s attachdetach-controller AttachVolume.Attach failed for volume "pvc-4d169587-a573-4e5a-a7d7-76326e07082a" : rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163`

To Reproduce

  1. Rancher cluster with multiple pods, with multiple PVCs mounted, on Longhorn 1.4.3
  2. Update cluster with concurrentAutomaticEngineUpgradePerNodeLimit set to 1
  3. Some pods should get stuck in ContainerCreating

Expected behavior

Pods can start up again after the update with an updated image.

Support bundle for troubleshooting

Can provide if needed.

Environment

  • Longhorn version: 1.4.3 -> 1.5.1
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm via Fleet
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.24.16
    • Number of management node in the cluster: 3
    • Number of worker node in the cluster: 3-42
  • Node config
    • OS type and version: Ubuntu 20.04.6 LTS
    • Kernel version: 5.4.0-149-generic
    • CPU per node: 20
    • Memory per node: 51
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMware/ Baremetal/ OTC
  • Number of Longhorn volumes in the cluster: up to 300

Additional context

CSI attacher logs:

I0922 09:27:31.321426       1 connection.go:193] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I0922 09:27:31.321443       1 connection.go:194] GRPC request: {"node_id":"worker2-az2","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"dataLocality":"disabled","fromBackup":"","fsType":"ext4","numberOfReplicas":"3","staleReplicaTimeout":"30","storage.kubernetes.io/csiProvisionerIdentity":"1677452580801-8081-driver.longhorn.io"},"volume_id":"pvc-4d169587-a573-4e5a-a7d7-76326e07082a"}
I0922 09:27:56.776914       1 connection.go:200] GRPC response: {}
I0922 09:27:56.776957       1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = volume pvc-1f15233c-88a6-4afa-b509-0ba5de7a6f01 failed to attach to node worker6-az2 with attachmentID csi-7a148c2c38d27d609a57560f74d486053a2f35fdf632aab1340ccc478e82ec6f
I0922 09:27:56.776985       1 csi_handler.go:591] Saving attach error to "csi-7a148c2c38d27d609a57560f74d486053a2f35fdf632aab1340ccc478e82ec6f"

Longhorn manager logs:

10.42.2.28 - - [22/Sep/2023:09:45:52 +0000] "GET /v1/volumes/pvc-4d169587-a573-4e5a-a7d7-76326e07082a HTTP/1.1" 200 7138 "" "Go-http-client/1.1"
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Volume, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\",\"recurring-job-group.longhorn.io/default\":\"enabled\",\"setting.longhorn.io/remove-snapshots-during-filesystem-trim\":\"ignored\",\"setting.longhorn.io/replica-auto-balance\":\"ignored\",\"setting.longhorn.io/snapshot-data-integrity\":\"ignored\"}}]" service=admissionWebhook
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Replica, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhorn.io/backing-image\":\"\",\"longhorndiskuuid\":\"3e9af771-0505-4c2d-ab69-c954870f83dc\",\"longhornnode\":\"worker9-az1\",\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}}]" service=admissionWebhook
10.42.2.28 - - [22/Sep/2023:09:45:52 +0000] "POST /v1/volumes/pvc-4d169587-a573-4e5a-a7d7-76326e07082a?action=attach HTTP/1.1" 200 6105 "" "Go-http-client/1.1"
time="2023-09-22T09:45:52Z" level=info msg="Creating instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759"
time="2023-09-22T09:45:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759\", UID:\"c0de5bdc-6ed9-464f-ad25-c28902bb7b0c\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"279418379\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759"
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Replica, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhorn.io/backing-image\":\"\",\"longhorndiskuuid\":\"20903d19-9f5f-4d1a-8bf3-fc9c52fe8c66\",\"longhornnode\":\"worker3-az1\",\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}}]" service=admissionWebhook
time="2023-09-22T09:45:52Z" level=info msg="Creating instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2"
time="2023-09-22T09:45:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2\", UID:\"cc9a80ef-fbd6-4f1e-a5e3-d8b0628d5733\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"279418383\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2"
time="2023-09-22T09:45:53Z" level=error msg="Failed to run engine live upgrade" controller=longhorn-engine engine=pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3 error="failed to live upgrade image for pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3: cannot get client for engine pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3: engine is not running" node=worker2-az2
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, Storage IP 10.42.11.145"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, IP 10.42.11.145"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, Port 10051"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, Storage IP 10.42.6.247"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, IP 10.42.6.247"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, Port 10000"

longhorn-csi-plugin/longhorn-csi-plugin logs:

time="2023-09-22T09:48:52Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:48:52Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"
time="2023-09-22T09:48:52Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"worker2-az2\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677452580801-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}"
time="2023-09-22T09:48:52Z" level=info msg="Volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a with accessMode rwo requesting publishing with attachInput &{Resource:{Id: Type: Links:map[] Actions:map[]} AttachedBy: AttacherType:csi-attacher AttachmentID:csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163 DisableFrontend:false HostId:worker2-az2}" component=csi-controller-server function=publishVolume
time="2023-09-22T09:50:22Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:50:22Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"
time="2023-09-22T09:50:22Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"worker2-az2\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677452580801-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}"
time="2023-09-22T09:50:22Z" level=info msg="Volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a with accessMode rwo requesting publishing with attachInput &{Resource:{Id: Type: Links:map[] Actions:map[]} AttachedBy: AttacherType:csi-attacher AttachmentID:csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163 DisableFrontend:false HostId:worker2-az2}" component=csi-controller-server function=publishVolume
time="2023-09-22T09:24:49Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:24:49Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"

About this issue

  • Original URL
  • State: closed
  • Created 9 months ago
  • Comments: 18 (14 by maintainers)

Most upvoted comments

Reproducing steps:

  1. Install Longhorn v1.5.1
  2. Deploy engine image longhornio/longhorn-engine:v1.4.3 using Longhorn UI
  3. Create a testvol
  4. Upgrade engine image for the volume from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:v1.4.3
  5. Attach the volume to a node
  6. Upgrade engine image for the volume from longhornio/longhorn-engine:v1.4.3 to longhornio/longhorn-engine:v1.5.1. Quickly detach the volume while the upgrade is in-progress
  7. Reattach the volume. Check if volume is able to finish attaching. Check if you can see the orphan replica references in engine.Spec,upgradedReplicaAddressMap
  8. If not hit the issue, repeat this process

Verified pass on longhorn master(longhorn-manager 286e6e)

Repeat below steps many times, did not encounter volume attach fail issue

  1. Install Longhorn master-head
  2. Deploy engine image longhornio/longhorn-engine:v1.5.1 using Longhorn UI
  3. Create a testvol
  4. Upgrade engine image for the volume from longhornio/longhorn-engine:master-head to longhornio/longhorn-engine:v1.5.1
  5. Attach the volume to a node
  6. Upgrade engine image for the volume from longhornio/longhorn-engine:v1.5.1 to longhornio/longhorn-engine:master-head. Quickly detach the volume while the upgrade is in-progress
  7. Reattach the volume.
  8. Volume attached, and value of engine.Spec,upgradedReplicaAddressMap is {}

Watching the engine yaml during the reproducing, it seems to agree with the code analysis at https://github.com/longhorn/longhorn/issues/6762#issuecomment-1740137202

➜  .kube k get engine -n longhorn-system -oyaml -w | grep 'engineImage\|upgradedReplicaAddressMap\|currentImage\|apiVersion' -A 1

# initially engine has image v1.4.3 in both spec and status
apiVersion: longhorn.io/v1beta2
kind: Engine
--
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
--
  engineImage: longhornio/longhorn-engine:v1.4.3
  frontend: blockdev
--
  upgradedReplicaAddressMap: {}
  volumeName: testvol
--
  currentImage: longhornio/longhorn-engine:v1.4.3
  currentReplicaAddressMap:
--

# engine.spec is changed to v1.5.1 but status is still v1.4.3
apiVersion: longhorn.io/v1beta2
kind: Engine
--
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
--
  engineImage: longhornio/longhorn-engine:v1.5.1
  frontend: blockdev
--
  upgradedReplicaAddressMap:
    testvol-r-1e6f72ef: 10.42.94.12:10010
--
  currentImage: longhornio/longhorn-engine:v1.4.3
  currentReplicaAddressMap:
--

# volume start detaching so engine start stopping
# engine.spec is v1.5.1 and status is v1.4.3

apiVersion: longhorn.io/v1beta2
kind: Engine
--
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
--
  engineImage: longhornio/longhorn-engine:v1.5.1
  frontend: blockdev
--
  upgradedReplicaAddressMap:
    testvol-r-1e6f72ef: 10.42.94.12:10010
--
  currentImage: longhornio/longhorn-engine:v1.4.3
  currentReplicaAddressMap:
--


# engine is stopped and there is orphan replica references in upgradedReplicaAddressMap
# engine.spec is to v1.5.1 and engine.status.currentImage is cleared

apiVersion: longhorn.io/v1beta2
kind: Engine
--
  - apiVersion: longhorn.io/v1beta2
    kind: Volume
--
  engineImage: longhornio/longhorn-engine:v1.5.1
  frontend: blockdev
--
  upgradedReplicaAddressMap:
    testvol-r-1e6f72ef: 10.42.94.12:10010
--
  currentImage: ""
  currentReplicaAddressMap:

@FrankYang0529 Shuo is off. Please try this and work with @PhanLe1010 or @derekbit. Thanks.

Thanks @puffitos

Log analysis

Tracking the volume pvc-cd43fc29-7407-4167-9583-589d5c76644e in the support bundle supportbundle_7f6b4866-3804-491f-afd0-a5593b50c69a_2023-09-21T12-51-44Z, I see that:

  1. The workload pod sherloq-dagster-postgresql-0 is stuck because CSI ControllerPublish can not succeed in attaching the volume
  2. CSI ControllerPublish can not succeed because of 2 reasons:
    1. Somehow the volume is attached to node worker9-az1 by longhorn-ui (It means that somehow someone manually attach this volume using Longhorn UI, perhaps in an effort to unblock this volume). Workaround: manually detach this volume using Longhorn UI @puffitos
    2. The second reason that block the CSI ControllerPublish is that endpoint of the volume (indicated by engine.Status.endpoint of the engine pvc-cd43fc29-7407-4167-9583-589d5c76644e-e-3294725d) is empty. This blocks the CSI ControllerPublish call as indicated here
  3. endpoint of the volume (indicated by engine.Status.endpoint of the engine pvc-cd43fc29-7407-4167-9583-589d5c76644e-e-3294725d) is empty because engine.Spec.upgradedReplicaAddressMap is non-empty. This blocks the engine monitor from syncing and updating engine.Status here
  4. This same problem happens with other volumes as well. For example, pvc-86463b26-dac8-4000-b281-4d6be0ee684c, pvc-7e2932a6-55a6-4909-bedb-de469fa8b0d4. There are total 3 volumes that is currently stuck in this state.

Code analysis

Looks like the direct reason that blocked the flow is the non-empty engine.Spec.upgradedReplicaAddressMap. I am investigating the volume controller code flow to see why it is possible that the volume finished the engine image upgrading but the engine.Spec.upgradedReplicaAddressMap is not cleaned up.

From the first round code flow review, it seems that this might happen like this:

  1. volume.Spec.EngineImage is updated to v1.5.1
  2. engine.Spec.EngineImage is updated to v1.5.1 and engine.Spec.UpgradedReplicaAddressMap is set to a new set of replica: link
  3. volume controller waits for the engine controller to (finish upgrading and set e.Status.CurrentImage to v1.5.1). However, before the engine controller set e.Status.CurrentImage to v1.5.1. The engine might crash.
  4. When engine/replica controller stopped the engine/replica, it will clear the engine.Status.CurrentImage and replica.Status.CurrentImage: link
  5. Also, volume controller blindly set the volume.Status.CurrentImage to v1.5.1 as well link
  6. When the volume is reattached, engine/replica are started and engine/replica controller blindly set engine.Status.CurrentImage and replica.Status.CurrentImage to v1.5.1 because the spec.EngineImage is v1.5.1 at this time. See link
  7. At this moment, both status and spec of engine/replica/volume CR are v1.5.1. However, the engine.spec.UpgradedReplicaAddressMap is NOT cleanup because volume controller no longer thinking that the volume is in upgrade process. I.e., this check over here stop the clean up

cc @shuo-wu @derekbit @james-munson WDYT?

Next steps

To confirm the above theory, I am trying to find a way to reproduce the problem

Update: Looks like we are able to reproduce it https://github.com/longhorn/longhorn/issues/6762#issuecomment-1744062922

Followup question

@puffitos Is it correct that some of the workloads here crashed/restarted during the engine upgrade?

@PhanLe1010 sorry for the confusion, we’ve had this problem in multiple clusters and we’ve only drawn one support bundle. In that bundle, the volume pvc-cd43fc29-7407-4167-9583-589d5c76644e should be affected by the problem.

@puffitos yes, it is perfectly acceptable

Understood. The bundle is quite large (a couple hundend MB), so I’ll need to upload it to Dropbox or Google Drive. Is that acceptable?