longhorn: [BUG] Volumes failing to mount because of engine upgradedReplicaAddressMap reference
Describe the bug (🐛 if you encounter this issue)
Pods mounting longhorn volumes stay on ContainerCreating because of DeadlineExceeded timeouts:
Warning FailedAttachVolume 2s attachdetach-controller AttachVolume.Attach failed for volume "pvc-4d169587-a573-4e5a-a7d7-76326e07082a" : rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163`
To Reproduce
- Rancher cluster with multiple pods, with multiple PVCs mounted, on Longhorn 1.4.3
- Update cluster with
concurrentAutomaticEngineUpgradePerNodeLimitset to 1 - Some pods should get stuck in
ContainerCreating
Expected behavior
Pods can start up again after the update with an updated image.
Support bundle for troubleshooting
Can provide if needed.
Environment
- Longhorn version: 1.4.3 -> 1.5.1
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): helm via Fleet
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE v1.24.16
- Number of management node in the cluster: 3
- Number of worker node in the cluster: 3-42
- Node config
- OS type and version: Ubuntu 20.04.6 LTS
- Kernel version: 5.4.0-149-generic
- CPU per node: 20
- Memory per node: 51
- Disk type(e.g. SSD/NVMe/HDD): SSD
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMware/ Baremetal/ OTC
- Number of Longhorn volumes in the cluster: up to 300
Additional context
CSI attacher logs:
I0922 09:27:31.321426 1 connection.go:193] GRPC call: /csi.v1.Controller/ControllerPublishVolume
I0922 09:27:31.321443 1 connection.go:194] GRPC request: {"node_id":"worker2-az2","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"dataLocality":"disabled","fromBackup":"","fsType":"ext4","numberOfReplicas":"3","staleReplicaTimeout":"30","storage.kubernetes.io/csiProvisionerIdentity":"1677452580801-8081-driver.longhorn.io"},"volume_id":"pvc-4d169587-a573-4e5a-a7d7-76326e07082a"}
I0922 09:27:56.776914 1 connection.go:200] GRPC response: {}
I0922 09:27:56.776957 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = volume pvc-1f15233c-88a6-4afa-b509-0ba5de7a6f01 failed to attach to node worker6-az2 with attachmentID csi-7a148c2c38d27d609a57560f74d486053a2f35fdf632aab1340ccc478e82ec6f
I0922 09:27:56.776985 1 csi_handler.go:591] Saving attach error to "csi-7a148c2c38d27d609a57560f74d486053a2f35fdf632aab1340ccc478e82ec6f"
Longhorn manager logs:
10.42.2.28 - - [22/Sep/2023:09:45:52 +0000] "GET /v1/volumes/pvc-4d169587-a573-4e5a-a7d7-76326e07082a HTTP/1.1" 200 7138 "" "Go-http-client/1.1"
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Volume, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\",\"recurring-job-group.longhorn.io/default\":\"enabled\",\"setting.longhorn.io/remove-snapshots-during-filesystem-trim\":\"ignored\",\"setting.longhorn.io/replica-auto-balance\":\"ignored\",\"setting.longhorn.io/snapshot-data-integrity\":\"ignored\"}}]" service=admissionWebhook
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Replica, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhorn.io/backing-image\":\"\",\"longhorndiskuuid\":\"3e9af771-0505-4c2d-ab69-c954870f83dc\",\"longhornnode\":\"worker9-az1\",\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}}]" service=admissionWebhook
10.42.2.28 - - [22/Sep/2023:09:45:52 +0000] "POST /v1/volumes/pvc-4d169587-a573-4e5a-a7d7-76326e07082a?action=attach HTTP/1.1" 200 6105 "" "Go-http-client/1.1"
time="2023-09-22T09:45:52Z" level=info msg="Creating instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759"
time="2023-09-22T09:45:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759\", UID:\"c0de5bdc-6ed9-464f-ad25-c28902bb7b0c\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"279418379\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759"
time="2023-09-22T09:45:52Z" level=info msg="Request (user: system:serviceaccount:longhorn-system:longhorn-service-account, longhorn.io/v1beta2, Kind=Replica, namespace: longhorn-system, name: pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2, operation: UPDATE) patchOps: [{\"op\": \"replace\", \"path\": \"/metadata/labels\", \"value\": {\"longhorn.io/backing-image\":\"\",\"longhorndiskuuid\":\"20903d19-9f5f-4d1a-8bf3-fc9c52fe8c66\",\"longhornnode\":\"worker3-az1\",\"longhornvolume\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}}]" service=admissionWebhook
time="2023-09-22T09:45:52Z" level=info msg="Creating instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2"
time="2023-09-22T09:45:52Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2\", UID:\"cc9a80ef-fbd6-4f1e-a5e3-d8b0628d5733\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"279418383\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2"
time="2023-09-22T09:45:53Z" level=error msg="Failed to run engine live upgrade" controller=longhorn-engine engine=pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3 error="failed to live upgrade image for pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3: cannot get client for engine pvc-4d169587-a573-4e5a-a7d7-76326e07082a-e-d8bd44e3: engine is not running" node=worker2-az2
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, Storage IP 10.42.11.145"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, IP 10.42.11.145"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-a1575759 starts running, Port 10051"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, Storage IP 10.42.6.247"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, IP 10.42.6.247"
time="2023-09-22T09:45:53Z" level=warning msg="Instance pvc-4d169587-a573-4e5a-a7d7-76326e07082a-r-fcf6d7f2 starts running, Port 10000"
longhorn-csi-plugin/longhorn-csi-plugin logs:
time="2023-09-22T09:48:52Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:48:52Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"
time="2023-09-22T09:48:52Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"worker2-az2\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677452580801-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}"
time="2023-09-22T09:48:52Z" level=info msg="Volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a with accessMode rwo requesting publishing with attachInput &{Resource:{Id: Type: Links:map[] Actions:map[]} AttachedBy: AttacherType:csi-attacher AttachmentID:csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163 DisableFrontend:false HostId:worker2-az2}" component=csi-controller-server function=publishVolume
time="2023-09-22T09:50:22Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:50:22Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"
time="2023-09-22T09:50:22Z" level=info msg="ControllerPublishVolume: req: {\"node_id\":\"worker2-az2\",\"volume_capability\":{\"AccessType\":{\"Mount\":{\"fs_type\":\"ext4\"}},\"access_mode\":{\"mode\":1}},\"volume_context\":{\"dataLocality\":\"disabled\",\"fromBackup\":\"\",\"fsType\":\"ext4\",\"numberOfReplicas\":\"3\",\"staleReplicaTimeout\":\"30\",\"storage.kubernetes.io/csiProvisionerIdentity\":\"1677452580801-8081-driver.longhorn.io\"},\"volume_id\":\"pvc-4d169587-a573-4e5a-a7d7-76326e07082a\"}"
time="2023-09-22T09:50:22Z" level=info msg="Volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a with accessMode rwo requesting publishing with attachInput &{Resource:{Id: Type: Links:map[] Actions:map[]} AttachedBy: AttacherType:csi-attacher AttachmentID:csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163 DisableFrontend:false HostId:worker2-az2}" component=csi-controller-server function=publishVolume
time="2023-09-22T09:24:49Z" level=warning msg="Timeout wTile waiting for volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a state volume published" component=csi-controller-server function=waitForVolumeState
time="2023-09-22T09:24:49Z" level=error msg="ControllerPublishVolume: err: rpc error: code = DeadlineExceeded desc = volume pvc-4d169587-a573-4e5a-a7d7-76326e07082a failed to attach to node worker2-az2 with attachmentID csi-af7050e7ae2c0ec71993783c35748dc8584e7a0b4f84de588ad0714968a1c163"
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 18 (14 by maintainers)
Reproducing steps:
longhornio/longhorn-engine:v1.4.3using Longhorn UItestvollonghornio/longhorn-engine:v1.5.1tolonghornio/longhorn-engine:v1.4.3longhornio/longhorn-engine:v1.4.3tolonghornio/longhorn-engine:v1.5.1. Quickly detach the volume while the upgrade is in-progressengine.Spec,upgradedReplicaAddressMapVerified pass on longhorn master(longhorn-manager
286e6e)Repeat below steps many times, did not encounter volume attach fail issue
longhornio/longhorn-engine:v1.5.1using Longhorn UIlonghornio/longhorn-engine:master-headtolonghornio/longhorn-engine:v1.5.1longhornio/longhorn-engine:v1.5.1tolonghornio/longhorn-engine:master-head. Quickly detach the volume while the upgrade is in-progressWatching the engine yaml during the reproducing, it seems to agree with the code analysis at https://github.com/longhorn/longhorn/issues/6762#issuecomment-1740137202
@FrankYang0529 Shuo is off. Please try this and work with @PhanLe1010 or @derekbit. Thanks.
Thanks @puffitos
Log analysis
Tracking the volume
pvc-cd43fc29-7407-4167-9583-589d5c76644ein the support bundlesupportbundle_7f6b4866-3804-491f-afd0-a5593b50c69a_2023-09-21T12-51-44Z, I see that:sherloq-dagster-postgresql-0is stuck because CSI ControllerPublish can not succeed in attaching the volumeworker9-az1by longhorn-ui (It means that somehow someone manually attach this volume using Longhorn UI, perhaps in an effort to unblock this volume). Workaround: manually detach this volume using Longhorn UI @puffitospvc-cd43fc29-7407-4167-9583-589d5c76644e-e-3294725d) is empty. This blocks the CSI ControllerPublish call as indicated herepvc-cd43fc29-7407-4167-9583-589d5c76644e-e-3294725d) is empty because engine.Spec.upgradedReplicaAddressMap is non-empty. This blocks the engine monitor from syncing and updating engine.Status herepvc-86463b26-dac8-4000-b281-4d6be0ee684c,pvc-7e2932a6-55a6-4909-bedb-de469fa8b0d4. There are total 3 volumes that is currently stuck in this state.Code analysis
Looks like the direct reason that blocked the flow is the non-empty
engine.Spec.upgradedReplicaAddressMap. I am investigating the volume controller code flow to see why it is possible that the volume finished the engine image upgrading but theengine.Spec.upgradedReplicaAddressMapis not cleaned up.From the first round code flow review, it seems that this might happen like this:
cc @shuo-wu @derekbit @james-munson WDYT?
Next steps
To confirm the above theory, I am trying to find a way to reproduce the problem
Update: Looks like we are able to reproduce it https://github.com/longhorn/longhorn/issues/6762#issuecomment-1744062922
Followup question
@puffitos Is it correct that some of the workloads here crashed/restarted during the engine upgrade?
@PhanLe1010 sorry for the confusion, we’ve had this problem in multiple clusters and we’ve only drawn one support bundle. In that bundle, the volume pvc-cd43fc29-7407-4167-9583-589d5c76644e should be affected by the problem.
@puffitos yes, it is perfectly acceptable
Understood. The bundle is quite large (a couple hundend MB), so I’ll need to upload it to Dropbox or Google Drive. Is that acceptable?