longhorn: [BUG] test_recurring_jobs_allow_detached_volume failed

Describe the bug (🐛 if you encounter this issue)

I have observed that the test case test_recurring_jobs_allow_detached_volume & test_recurring_jobs_when_volume_detached_unexpectedly are failing intermittently on SLES and SLE-micro. The volume backup status is showing as Error. I also tried to reproduce the issue manually, and I was successful in reproducing it. Screenshot_20230614_180756

Screenshot_20230614_163356

To Reproduce

Steps to reproduce the behavior:

https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/
Verify test result of test_recurring_jobs_allow_detached_volume

Alternatively,

Given allow-recurring-job-while-volume-detached set to true.
And volume created and attached.
And 50MB data written to volume.
And volume detached.
When a recurring job created runs every minute.
And wait for backup to complete.

Expected behavior

We should have consistent test results on all distro.

Log or Support bundle

supportbundle_623f96f1-0ad6-487b-b6e2-7f961da0af8f_2023-06-14T08-34-19Z.zip

supportbundle_623f96f1-0ad6-487b-b6e2-7f961da0af8f_2023-06-14T14-23-13Z.zip

Environment

Longhorn version: v1.5.0-rc2
Installation method (e.g. Rancher Catalog App/Helm/Kubectl):Kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:v1.25.9+k3s1
- Number of management node in the cluster:1
- Number of worker node in the cluster:3
Node config
- OS type and version: SLE-Micro 5.3
- CPU per node: 4
- Memory per node: 16G
- Disk type(e.g. SSD/NVMe): SSD
- Network bandwidth between the nodes:
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):AWS
Number of Longhorn volumes in the cluster:

Additional context

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (13 by maintainers)

Most upvoted comments

Thanks @ChanYiLin for the excellent explanation! I agree with your analysis that the error popped from enableBackupMonitor which is a part of the function checkMonitor. Your fix is also very clever! 🚀

PhanLe1010 on Jun 16, 2023

From the log I found it deleted the attachment then the update failed So for the next run, it re do the sync again

The error happens in checkMonitor() when enabling the syncMonitor https://github.com/longhorn/longhorn-manager/blob/master/controller/backup_controller.go#L646-L652

#### For several run, it succeeded to create the backup in backupstore, and backup_volume_controller pull the backup ####
...
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before syncWithMonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=test controller=longhorn-backup-volume node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG after syncWithMonitor]=== backup.Status.State Completed" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG]=== backup disableBackupMonitor backup-6dfcb222ea3e4ac5" controller=longhorn-backup node=kworker2

#### In backup_controller.go since the backup succeeded, LastSyncedAt was not zero, it delete the attachment ####
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 2023-06-15 12:55:20.175958749 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup controller]=== Deleting attachment" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### It failed to update the backup for this reconcile loop, so the state was not updated and was still empty ####
time="2023-06-15T12:55:20Z" level=debug msg="Requeue backup-6dfcb222ea3e4ac5 due to conflict" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup error="Operation cannot be fulfilled on backups.longhorn.io \"backup-6dfcb222ea3e4ac5\": the object has been modified; please apply your changes to the latest version and try again" node=kworker2

ChanYiLin on Jun 16, 2023

Verified on v1.5.x-head 20230620

longhorn v1.5.x-head https://github.com/longhorn/longhorn/commit/e651a6a368bc521c71b6383d0ba99cb54a973185
longhorn-manager v1.5.x-head https://github.com/longhorn/longhorn-manager/commit/d85f96bcfa0d5f54eb72981f5030fa88f6710f3a

Verified on master-head 20230620

longhorn master-head https://github.com/longhorn/longhorn/commit/a601ecc4684435093a9d92dd05621b837d77a5ad
longhorn-manager master-head https://github.com/longhorn/longhorn-manager/commit/a3b16b886e5bb0ed9faac862c70042a06c19998f

Result Passed

The test_recurring_jobs_allow_detached_volume test has passed. The v1.5.x-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4236/ The master-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4260/

The test_recurring_jobs_when_volume_detached_unexpectedly test has passed. The v1.5.x-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4237/ The master-head test results : https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4261/

Screenshot_20230620_144250

roger-ryao on Jun 20, 2023

I think there is a problem in the backup monitoring instead of the backup controller @ChanYiLin

However, my head has stopped working now so I will do some more investigations and update tomorrow.

PhanLe1010 on Jun 16, 2023

Oh, I got one

https://github.com/longhorn/longhorn-manager/blob/af2f111f6dfba09aaf24eaf2860ef3300af96c2a/controller/backup_controller.go#L321-L338

	defer func() {
		if err != nil {
			return
		}
		if reflect.DeepEqual(existingBackup.Status, backup.Status) {
			return
		}
		if _, err := bc.ds.UpdateBackupStatus(backup); err != nil && apierrors.IsConflict(errors.Cause(err)) {
			log.WithError(err).Debugf("Requeue %v due to conflict", backupName)
			bc.enqueueBackup(backup)
		}
	}()

	defer func() {
		if !backup.Status.LastSyncedAt.IsZero() || backup.Spec.SnapshotName == "" {
			err = bc.handleAttachmentTicketDeletion(backup, backupVolumeName)
		}
	}()

This issue is caused by status not updated to the k8s(etcd), but because the second defer function execute first, it use the in-memory status which is not empty so it detach the volume

Maybe we can move the second defer function to the first place after controller update the backup status ,and get the backup again before getting the status So if the status is not updated, it won’t detach the volume

WDYT?

cc @innobead @PhanLe1010

ChanYiLin on Jun 15, 2023

manager’s log

#### In backup_controller.go checkMonitor() attaching the volume and wait for attachment ####
time="2023-06-15T12:55:13Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:13Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"
time="2023-06-15T12:55:13Z" level=warning msg="Failed to get engine client when syncing backup status" controller=longhorn-backup error="cannot get client for engine test-e-e32fbfb7: engine is not running" node=kworker2
time="2023-06-15T12:55:13Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:13Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 0, attachmentTicketStatus.Generation 0"
time="2023-06-15T12:55:13Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 0%!(EXTRA int64=0)"
time="2023-06-15T12:55:13Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: false"
time="2023-06-15T12:55:13Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 0001-01-01 00:00:00 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:13Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-6dfcb222ea3e4ac5" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-6dfcb222ea3e4ac5: waiting for attachment backup-controller-backup-6dfcb222ea3e4ac5 to be attached before enabling backup monitor" node=kworker2
time="2023-06-15T12:55:15Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 0, attachmentTicketStatus.Generation 0"
time="2023-06-15T12:55:15Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 0%!(EXTRA int64=0)"
time="2023-06-15T12:55:15Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: true"
192.168.41.165 - - [15/Jun/2023:12:55:15 +0000] "GET /v1/volumes/test HTTP/1.1" 200 7244 "" "Go-http-client/1.1"
time="2023-06-15T12:55:16Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:16Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"
time="2023-06-15T12:55:16Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### In backup_controller.go checkMonitor() check volume is Attachment Ticket is Satisfied ####
time="2023-06-15T12:55:16Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 0, attachmentTicketStatus.Generation 0"
time="2023-06-15T12:55:16Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 0%!(EXTRA int64=0)"
time="2023-06-15T12:55:16Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: true"

#### In backup_controller.go Start monitor to call engine to create the backup ####
time="2023-06-15T12:55:18Z" level=info msg="===[DEBUG before syncWithMonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:18Z" level=info msg="===[DEBUG after syncWithMonitor]=== backup.Status.State " backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:18Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 0001-01-01 00:00:00 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### For several run, it succeeded to create the backup in backupstore, and backup_volume_controller pull the backup ####
...
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before syncWithMonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="Found 1 backups in the backup target that do not exist in the cluster and need to be pulled" backupVolume=test controller=longhorn-backup-volume node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG after syncWithMonitor]=== backup.Status.State Completed" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG]=== backup disableBackupMonitor backup-6dfcb222ea3e4ac5" controller=longhorn-backup node=kworker2

#### In backup_controller.go since the backup succeeded, LastSyncedAt was not zero, it delete the attachment ####
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 2023-06-15 12:55:20.175958749 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup controller]=== Deleting attachment" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### It failed to update the backup for this reconcile loop, so the state was not updated and was still empty ####
time="2023-06-15T12:55:20Z" level=debug msg="Requeue backup-6dfcb222ea3e4ac5 due to conflict" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup error="Operation cannot be fulfilled on backups.longhorn.io \"backup-6dfcb222ea3e4ac5\": the object has been modified; please apply your changes to the latest version and try again" node=kworker2

#### In backup_controller.go Since the state was not updated, the backup controller re-do the backup process ####
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### In backup_controller.go attachmentTicket.Generation is now 1 and is not satisfied ####
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 1, attachmentTicketStatus.Generation 0"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 1%!(EXTRA int64=0)"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: false"
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 0001-01-01 00:00:00 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2

#### In backup_controller.go waiting for attachment ####
time="2023-06-15T12:55:20Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-6dfcb222ea3e4ac5" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-6dfcb222ea3e4ac5: waiting for attachment backup-controller-backup-6dfcb222ea3e4ac5 to be attached before enabling backup monitor" node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 1, attachmentTicketStatus.Generation 0"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 1%!(EXTRA int64=0)"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: false"
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before defer delete attachement]=== LastSyncedAt: 0001-01-01 00:00:00 +0000 UTC, SnapshotName: c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-6dfcb222ea3e4ac5" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-6dfcb222ea3e4ac5: waiting for attachment backup-controller-backup-6dfcb222ea3e4ac5 to be attached before enabling backup monitor" node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG backup]=== Attaching the volume" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[DEBUG] ticketID: backup-controller-backup-6dfcb222ea3e4ac5 already exists"

#### In backup_controller.go the Attachment Ticket Satisfied became true again but the engine just got deleted at 12:55:20 ####
time="2023-06-15T12:55:20Z" level=info msg="===[DEBUG before checkmonitor]===" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup node=kworker2
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicket.Generation 1, attachmentTicketStatus.Generation 1"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied attachmentTicketStatus.Satisfied 1%!(EXTRA int64=1)"
time="2023-06-15T12:55:20Z" level=info msg="[Debug] Checking Attachment Ticket Satisfied is Satisfied: true"
time="2023-06-15T12:55:20Z" level=warning msg="Cannot take snapshot backup" backup=backup-6dfcb222ea3e4ac5 controller=longhorn-backup 
error="proxyServer=192.168.77.173:8501 destination=192.168.41.155:10010: 
failed to backup snapshot c-wtl8g0-61821549-f397-4d9f-ae30-b3f65fcd55e5 to backup-6dfcb222ea3e4ac5: rpc error: code = Unknown 
desc = failed to get volume 192.168.41.155:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=kworker2

another manager’s log for engine

#### engine started to run at 12:55:13 ####
time="2023-06-15T12:55:13Z" level=warning msg="Instance test-e-e32fbfb7 starts running, Storage IP 192.168.41.155"
time="2023-06-15T12:55:13Z" level=warning msg="Instance test-e-e32fbfb7 starts running, IP 192.168.41.155"
time="2023-06-15T12:55:13Z" level=warning msg="Instance test-e-e32fbfb7 starts running, Port 10010"
time="2023-06-15T12:55:13Z" level=info msg="Starting monitoring engine" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:13Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"test\", UID:\"9e4c85e8-546d-4e45-a06c-5743bcae3fea\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"96183\", FieldPath:\"\"}): type: 'Normal' reason: 'Attached' volume test has been attached to kworker1"

#### engine got deleted at 12:55:20 when the backup succeeded for the first time, but unfortunately the status was not updated ####
time="2023-06-15T12:55:20Z" level=info msg="Deleting instance test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:20Z" level=error msg="Failed to sync Longhorn VolumeAttachment longhorn-system/test" controller=longhorn-volume-attachment error="longhorn-volume-attachment: failed to sync VolumeAttachment longhorn-system/test: Operation cannot be fulfilled on volumeattachments.longhorn.io \"test\": the object has been modified; please apply your changes to the latest version and try again" node=kworker1
time="2023-06-15T12:55:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"test-e-e32fbfb7\", UID:\"7dfd8185-8bd2-4618-93dd-87f61b266980\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"96234\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting instance test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"test-e-e32fbfb7\", UID:\"7dfd8185-8bd2-4618-93dd-87f61b266980\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"96236\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting instance test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"test-e-e32fbfb7\", UID:\"7dfd8185-8bd2-4618-93dd-87f61b266980\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"96236\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Stopping monitoring engine" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:20Z" level=debug msg="Requeue engine due to conflict" controller=longhorn-engine engine=test-e-e32fbfb7 error="Operation cannot be fulfilled on engines.longhorn.io \"test-e-e32fbfb7\": the object has been modified; please apply your changes to the latest version and try again" node=kworker1
time="2023-06-15T12:55:20Z" level=info msg="Deleting instance test-e-e32fbfb7"
time="2023-06-15T12:55:20Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-e32fbfb7 node=kworker1
time="2023-06-15T12:55:20Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"test-e-e32fbfb7\", UID:\"7dfd8185-8bd2-4618-93dd-87f61b266980\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"96242\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops test-e-e32fbfb7"

ChanYiLin on Jun 15, 2023

I kind of find the root cause it is a flaky situation, might need some time to summarize the issue.

ChanYiLin on Jun 15, 2023

after some investigation I think the error is because of the following flow

here: when created backup, it first attached the volume
here: it then checkMonitor
here: since the volume had not been attached, it then returned the error and waited for the next reconcile loop
here: when returned from the reconcile loop, the defer function detach the volume

I can see the log when reproducing the issue

// waiting for attachment in backup_controller
time="2023-06-15T07:03:19Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-d1f8fc6006c3460b" 
controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-d1f8fc6006c3460b: 
waiting for attachment backup-controller-backup-d1f8fc6006c3460b to be attached before enabling backup monitor" node=kworker1

time="2023-06-15T07:03:19Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-d1f8fc6006c3460b" 
controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-d1f8fc6006c3460b: 
waiting for attachment backup-controller-backup-d1f8fc6006c3460b to be attached before enabling backup monitor" node=kworker1

// the volume had been detached, engine process was deleted
time="2023-06-15T07:03:19Z" level=info msg="Deleting instance test-e-9048ebac"
time="2023-06-15T07:03:19Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-9048ebac node=kworker1

// failed to reattach the volume since it failed to update the VolumeAttachment
time="2023-06-15T07:03:19Z" level=error msg="Failed to sync Longhorn VolumeAttachment longhorn-system/test" 
controller=longhorn-volume-attachment error="longhorn-volume-attachment: failed to sync VolumeAttachment longhorn-system/test:
 Operation cannot be fulfilled on volumeattachments.longhorn.io \"test\": the object has been modified; 
 please apply your changes to the latest version and try again" node=kworker1

time="2023-06-15T07:03:19Z" level=info msg="Deleting instance test-e-9048ebac"
time="2023-06-15T07:03:19Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=test-e-9048ebac node=kworker1

// backup_controller got the error it failed to connect to replicas
time="2023-06-15T07:03:19Z" level=warning msg="Cannot take snapshot backup" 
backup=backup-d1f8fc6006c3460b controller=longhorn-backup 
error="proxyServer=192.168.41.169:8501 destination=192.168.41.169:10010: 
failed to backup snapshot c-pea1wn-7c6d8059-c5ca-44be-880a-ce07fdde366d to backup-d1f8fc6006c3460b: 
rpc error: code = Unknown desc = failed to get volume 192.168.41.169:10010: rpc error: code = Unavailable 
desc = error reading from server: EOF" node=kworker1

I think that is why the backup failed and the message was related to connection

The next reconcile loop should reattach the volume again, but as you can see from the above logs It failed to sync VolumeAttachment because it has been modified

cc @innobead @PhanLe1010

ChanYiLin on Jun 15, 2023