longhorn: [BUG] After crashed engine process, volume stuck in `Unknown` state

Describe the bug (๐Ÿ› if you encounter this issue)

Test case test_recurring_jobs_when_volume_detached_unexpectedly randomly fails intermittently on master-head and v1.5.x-head with a failure rate of approximately 2 out of 20.

E AssertionError: value = healthy
E.volume[key] = unknown
E.volume = {
    'accessMode': 'rwo',
    'backendStoreDriver': 'v1',
    'backingImage': '',
    'backupCompressionMethod': 'lz4',
    'backupStatus': [{
        'backupURL': '',
        'error': 'proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to get backup-97f0ecc4095b460b backup status: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
        'progress': 54,
        'replica': 'tcp://10.42.1.10:10000',
        'size': '',
        'snapshot': 'recurrin-ff62e23a-fa5d-4c60-8214-72b854e38f02',
        'state': 'Error'
    }, {
        'backupURL': '',
        'error': 'error getting snapshot \'recurrin-cf217437-01d7-4f1b-acc9-1fd9d513632c\', engine \'longhorn-testvol-0gbgrf-e-b242988d\': proxyServer=10.42.1.10:8501 destination=10.42.2.9:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
        'progress': 0,
        'replica': '',
        'size': '',
        'snapshot': '',
        'state': 'Error'
    }],
    'cloneStatus': {
        'snapshot': '',
        'sourceVolume': '',
        'state': ''
    },
    'conditions': {
        'restore': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'False'
        },
        'scheduled': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'True'
        },
        'toomanysnapshots': {
            'lastProbeTime': '',
            'lastTransitionTime': '2023-09-12T06:30:13Z',
            'message': '',
            'reason': '',
            'status': 'False'
        }
    },
    'controllers': [{
        'actualSize': '1150484480',
        'address': '',
        'currentImage': '',
        'endpoint': '',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'hostId': '',
        'instanceManagerName': '',
        'isExpanding': False,
        'lastExpansionError': '',
        'lastExpansionFailedAt': '',
        'lastRestoredBackup': '',
        'name': 'longhorn-testvol-0gbgrf-e-b242988d',
        'requestedBackupRestore': '',
        'running': False,
        'size': '2147483648',
        'unmapMarkSnapChainRemovedEnabled': False
    }],
    'created': '2023-09-12 06:30:12 +0000 UTC',
    'currentImage': 'longhornio/longhorn-engine:master-head',
    'dataLocality': 'disabled',
    'dataSource': '',
    'disableFrontend': False,
    'diskSelector': [],
    'encrypted': False,
    'engineImage': 'longhornio/longhorn-engine:master-head',
    'fromBackup': '',
    'frontend': 'blockdev',
    'kubernetesStatus': {
        'lastPVCRefAt': '',
        'lastPodRefAt': '2023-09-12T06:30:54Z',
        'namespace': 'default',
        'pvName': 'longhorn-testvol-0gbgrf-pv',
        'pvStatus': 'Bound',
        'pvcName': 'longhorn-testvol-0gbgrf-pvc',
        'workloadsStatus': [{
            'podName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88-xmxzx',
            'podStatus': 'Succeeded',
            'workloadName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88',
            'workloadType': 'ReplicaSet'
        }]
    },
    'lastAttachedBy': '',
    'lastBackup': '',
    'lastBackupAt': '',
    'migratable': False,
    'name': 'longhorn-testvol-0gbgrf',
    'nodeSelector': [],
    'numberOfReplicas': 3,
    'offlineReplicaRebuilding': 'disabled',
    'offlineReplicaRebuildingRequired': False,
    'purgeStatus': None,
    'ready': True,
    'rebuildStatus': [],
    'recurringJobSelector': None,
    'replicaAutoBalance': 'ignored',
    'replicaDiskSoftAntiAffinity': 'ignored',
    'replicaSoftAntiAffinity': 'ignored',
    'replicaZoneSoftAntiAffinity': 'ignored',
    'replicas': [{
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-0ff7d9b8',
        'diskID': '2a1fc81e-3ab8-463f-a818-b008f7e858a3',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-49',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-0600c183',
        'running': False
    }, {
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-5b039c13',
        'diskID': '51cbadf6-8b10-438a-b1ce-8ae00126e6ce',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-97',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-319ab5fc',
        'running': False
    }, {
        'address': '',
        'backendStoreDriver': 'v1',
        'currentImage': '',
        'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-c166e1b5',
        'diskID': '4d810ed2-9514-4b04-9559-0b2f6d830310',
        'diskPath': '/var/lib/longhorn/',
        'engineImage': 'longhornio/longhorn-engine:master-head',
        'failedAt': '',
        'hostId': 'ip-10-0-2-213',
        'instanceManagerName': '',
        'mode': '',
        'name': 'longhorn-testvol-0gbgrf-r-895589ea',
        'running': False
    }],
    'restoreInitiated': False,
    'restoreRequired': False,
    'restoreStatus': [],
    'restoreVolumeRecurringJob': 'ignored',
    'revisionCounterDisabled': False,
    'robustness': 'unknown',
    'shareEndpoint': '',
    'shareState': '',
    'size': '2147483648',
    'snapshotDataIntegrity': 'ignored',
    'staleReplicaTimeout': 0,
    'standby': False,
    'state': 'detached',
    'unmapMarkSnapChainRemoved': 'ignored',
    'volumeAttachment': {
        'attachments': {},
        'volume': 'longhorn-testvol-0gbgrf'
    }
}

To Reproduce

Excute test case test_recurring_jobs_when_volume_detached_unexpectedly

Expected behavior

We should have consistent test results on all distro.

Support bundle for troubleshooting

longhorn-tests-regression-4900-bundle (1).zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.1+k3s1
    • Number of management node in the cluster:1
    • Number of worker node in the cluster:3
  • Node config
    • OS type and version: SLES 15-sp5
    • Kernel version:
    • CPU per node: 4 cores
    • Memory per node: 16GB
    • Disk type(e.g. SSD/NVMe/HDD): SSD
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:1
  • Impacted Longhorn resources:
    • Volume names: longhorn-testvol-0gbgrf

Additional context

N/A

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 2
  • Comments: 18 (11 by maintainers)

Most upvoted comments

@nitendra-suse Letโ€™s move back to the implement pipeline.

@roger-ryao as you are the original reporter, please check the verification result by @nitendra-suse .

Here is the log after the fix You can see that volume needed some time to update its state to detaching, before that, the attachment ticket of backup controller was true After that, attachment controller would then update the satisfied to false.

## attachment ticket of backup controller is satisfied, but engine desireState is stopped
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: attached, volume.Robustness: healthy\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=warning msg="Failed to get snapshot recurrin-be1c0640-132c-4695-9c4c-5b1cf621923d when syncing backup status" controller=longhorn-backup error="proxyServer=10.42.3.8:8501 destination=10.42.3.8:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.3.8:10010: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.3.8:10010: connect: connection refused\"" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"

## volume attachment controller update the ticket to no satisfied because volume is now in detaching state
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: backup-controller-backup-a38cd85ddfc34e94, ticket: &{backup-controller-backup-a38cd85ddfc34e94 backup-controller ip-10-0-2-104 map[disableFrontend:any] 0}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-104, attachmentTicket.NodeID: ip-10-0-2-104\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: detaching\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: false\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-a38cd85ddfc34e94\":{\"id\":\"backup-controller-backup-a38cd85ddfc34e94\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-26T09:32:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-a38cd85ddfc34e94\":{\"id\":\"backup-controller-backup-a38cd85ddfc34e94\",\"satisfied\":false,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"False\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-26T09:32:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-26T09:32:11Z" level=warning msg="Failed to get snapshot recurrin-be1c0640-132c-4695-9c4c-5b1cf621923d when syncing backup status" controller=longhorn-backup error="proxyServer=10.42.3.8:8501 destination=10.42.3.8:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.3.8:10010: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.3.8:10010: connect: connection refused\"" node=ip-10-0-2-104
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG checkMonitor] volume.State: detaching, volume.Robustness: unknown\n"
time="2023-09-26T09:32:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-a38cd85ddfc34e94" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-a38cd85ddfc34e94: waiting for engine longhorn-testvol-1qqw62-e-0e7026ad to be running before enabling backup monitor" node=ip-10-0-2-104

## attachment ticket of backup controller is now not satisfied
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-a38cd85ddfc34e94\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-a38cd85ddfc34e94 "
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG backup controller] false\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-26T09:32:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: detaching, volume.Robustness: unknown\n"

I believe so. User can only take backup of attached volume from UI. In this case, since the volume is already attached, volume would not be detached by snapshot-controller and volume-attachment-controller after the creation of snapshot. So there is no detaching and attaching between taking snapshot and taking backup.

This issue happens because the volume is detaching by previous controller and attaching by the next controller. There is a chance that the attachment ticket of the next controller might get satisfied immediately due to the volume is not fully detached (state is not updated yet). Thus the following code logic gets mislead.

The flow of the test is

  1. recurring job creates snapshot (volume would attach and detach)
  2. recurring job creates backup (volume would attach and detach)
  3. crash engine so the Volume is in faulted state and will come back in a while (volume would detach)
  4. recurring job retries
  5. recurring job creates snapshot (volume would attach and detach)
  6. recurring job creates backup (volume would attach and detach)

The root cause of the issue can be found from my test with Debug message Here are the logs

## Snapshot had been created, so deleted the attachment ticket
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG snapshot controller] Snapshot created so delete delete Attachment Ticket, snapshot: recurrin-f4852903-2357-46ec-9037-16137d6aea80\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"type\":\"snapshot-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80, ticket: &{snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80 snapshot-controller ip-10-0-2-223 map[disableFrontend:any] 0}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-223, attachmentTicket.NodeID: ip-10-0-2-223\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: attached\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: running\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"

## Attachment was deleted, so the new VA status was empty, and the nodeID of the Volume was also empty so Longhorn could detach the Volume 
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"ip-10-0-2-223\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{}}\n"

## First time backup controller create attacment but failed because CR conflicts (Not a big deal)
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\":{\"id\":\"snapshot-controller-recurrin-f4852903-2357-46ec-9037-16137d6aea80\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:06Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}"
time="2023-09-22T10:37:11Z" level=error msg="Failed to sync Longhorn backup longhorn-system/backup-660589ed3d0b44b8" controller=longhorn-backup error="longhorn-backup: failed to sync backup longhorn-system/backup-660589ed3d0b44b8: handleAttachmentTicketCreation: failed to create/update attachment: Operation cannot be fulfilled on volumeattachments.longhorn.io \"longhorn-testvol-1czczl\": the object has been modified; please apply your changes to the latest version and try again" node=ip-10-0-2-223

## backup controller create attachment ticket 2nd time
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] Create Backup so Create attachment ticket, backup: backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old spec: {\"attachmentTickets\":{},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: old status: {\"attachmentTicketStatuses\":{}}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new spec: {\"attachmentTickets\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"type\":\"backup-controller\",\"nodeID\":\"ip-10-0-2-223\",\"parameters\":{\"disableFrontend\":\"any\"},\"generation\":0}},\"volume\":\"longhorn-testvol-1czczl\"}"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volumeattachment update]: new status: {\"attachmentTicketStatuses\":{}}"

# At the same time, Engine was being deleted
time="2023-09-22T10:37:11Z" level=info msg="Deleting instance longhorn-testvol-1czczl-e-db3461c5"
time="2023-09-22T10:37:11Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=longhorn-testvol-1czczl-e-db3461c5 node=ip-10-0-2-223

## Volume Attachment Controller handled Backup Controler Attachment Ticket
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] update VA status, ID: backup-controller-backup-660589ed3d0b44b8, ticket: &{backup-controller-backup-660589ed3d0b44b8 backup-controller ip-10-0-2-223 map[disableFrontend:any] 0}\n"

## Just in the same reconcile loop, the ticket was satisfied beacuse the volume was attached and the engine was still running
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] Create VA status for the first time, attachmentTicketID: backup-controller-backup-660589ed3d0b44b8\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.CurrentNodeID: ip-10-0-2-223, attachmentTicket.NodeID: ip-10-0-2-223\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] vol.Status.State: attached\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] satisfied: true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG volume attachment controller] engine.Status.CurrentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new Volume spec: {\"size\":\"2147483648\",\"frontend\":\"blockdev\",\"fromBackup\":\"\",\"restoreVolumeRecurringJob\":\"ignored\",\"dataSource\":\"\",\"dataLocality\":\"disabled\",\"staleReplicaTimeout\":0,\"nodeID\":\"\",\"migrationNodeID\":\"\",\"engineImage\":\"\",\"image\":\"longhornio/longhorn-engine:master-head\",\"backingImage\":\"\",\"Standby\":false,\"diskSelector\":[],\"nodeSelector\":[],\"disableFrontend\":false,\"revisionCounterDisabled\":false,\"unmapMarkSnapChainRemoved\":\"ignored\",\"replicaSoftAntiAffinity\":\"ignored\",\"replicaZoneSoftAntiAffinity\":\"ignored\",\"replicaDiskSoftAntiAffinity\":\"ignored\",\"lastAttachedBy\":\"\",\"accessMode\":\"rwo\",\"migratable\":false,\"encrypted\":false,\"numberOfReplicas\":3,\"replicaAutoBalance\":\"ignored\",\"snapshotDataIntegrity\":\"ignored\",\"backupCompressionMethod\":\"lz4\",\"backendStoreDriver\":\"v1\",\"offlineReplicaRebuilding\":\"disabled\"}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controller] update Volume Attachement status\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] old VA Status: {\"attachmentTicketStatuses\":{}}\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG Volume Attachment Controoler] new VA Status: {\"attachmentTicketStatuses\":{\"backup-controller-backup-660589ed3d0b44b8\":{\"id\":\"backup-controller-backup-660589ed3d0b44b8\",\"satisfied\":true,\"conditions\":[{\"type\":\"Satisfied\",\"status\":\"True\",\"lastProbeTime\":\"\",\"lastTransitionTime\":\"2023-09-22T10:37:11Z\",\"reason\":\"\",\"message\":\"\"}],\"generation\":0}}}\n"

## In Backup Controller, since the attachment was satisfied and the engine was running so it tried to request engine to create backup
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] attachment ticket\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller]: attachmentTicketStatuses key: backup-controller-backup-660589ed3d0b44b8 "
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG backup controller] true\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG sync snapshot status] volume.State: attached, volume.Robustness: healthy\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] engine.currentState: running, engine.desireState: stopped\n"
time="2023-09-22T10:37:11Z" level=info msg="[DEBUG checkMonitor] volume.State: attached, volume.Robustness: healthy\

time="2023-09-22T10:37:11Z" level=info msg="Deleting instance longhorn-testvol-1czczl-e-db3461c5"
time="2023-09-22T10:37:11Z" level=info msg="Deleting engine instance" controller=longhorn-engine engine=longhorn-testvol-1czczl-e-db3461c5 node=ip-10-0-2-223
time="2023-09-22T10:37:11Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"longhorn-testvol-1czczl-e-db3461c5\", UID:\"f26c0786-d60d-4a7e-883d-8883ceaca977\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"12008\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops longhorn-testvol-1czczl-e-db3461c5"

## Failed to create Backup so the test failed.
time="2023-09-22T10:37:11Z" level=warning msg="Cannot take snapshot backup" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223
time="2023-09-22T10:37:11Z" level=warning msg="Failed to enable the backup monitor for backup backup-660589ed3d0b44b8" backup=backup-660589ed3d0b44b8 controller=longhorn-backup error="proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to backup snapshot recurrin-f4852903-2357-46ec-9037-16137d6aea80 to backup-660589ed3d0b44b8: rpc error: code = Unknown desc = failed to get volume 10.42.2.9:10010: rpc error: code = Unavailable desc = error reading from server: EOF" node=ip-10-0-2-223

The race condition flow

  1. When snapshot was created, Snapshot Controller deleted the attachment ticket to detach the volume
  2. Volume Controller detached the volume, set the v.Status.State = longhorn.VolumeStateDetaching and set the e.Spec.DesiredState = stopped here
  3. Before the state was updated, Backup Controller created the attachment ticket
  4. Volume Controller updated status follow the order Replica->Engine->Volume here
  5. After Volume Controller updated Engine spec and status
  • so the e.Spec.DesiredState=stopped and e.Status.CurrentState=running
  1. VolumeAttachment Controller updated the status of the attachment ticket and it was satisfied because volume was still attached. (But the volume was about to be detached, just because the state was not updated yet)
  2. Engine Controller requested IM to delete the engine process, but not yet updated engine state
  3. Backup Controller verified in checkMonitor() that
  • Attachment ticket was satisfied here
  • Engine was running here
  1. Backup Controller requested Engine to create backup and failed

I think there might have more race condition case for engine/volume/volumeattachment statue like this one. We might need to review and see if we can improve from volumeattachment side, so it wonโ€™t satisfied just because the volume is not fully detached from the previous ticket? cc @PhanLe1010 @derekbit @innobead @roger-ryao @longhorn/dev-control-plane

I triggered the test case 60 times, and all tests passed successfully. I believe that @ChanYiLin has already resolved the issue.

Let me run again on my system.Thanks.

Hi @innobead volume attachment is only introduced after 1.5.0, we donโ€™t have to backport to v1.4.x