longhorn: [BUG] migration test cases could fail due to unexpected volume controllers and replicas status

Describe the bug (🐛 if you encounter this issue)

In Longhorn master or v1.5.x, the migration related test cases like test_migration_with_failed_replica, test_migration_with_unscheduled_replica, test_migration_with_failed_replica, test_migration_with_restore_volume and test_migration_with_rebuilding_replica could randomly fail due to the volume controllers and replicas status are not expected:

    def wait_for_volume_migration_node(client, volume_name, node_id):
        ready = False
        for i in range(RETRY_COUNTS):
            v = client.by_id_volume(volume_name)
            engines = v.controllers
            replicas = v.replicas
            if len(engines) == 1 and len(replicas) == v.numberOfReplicas:
                e = engines[0]
                if e.endpoint != "":
                    assert e.hostId == node_id
                    ready = True
                    break
            time.sleep(RETRY_INTERVAL)
>       assert ready
E       AssertionError

The volume status is:

{

    "accessMode": "rwx",
    "actions": {
        "[activate](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=activate"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=activate),
        "[attach](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=attach"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=attach),
        "[cancelExpansion](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=cancelExpansion"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=cancelExpansion),
        "[detach](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=detach"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=detach),
        "[engineUpgrade](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=engineUpgrade"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=engineUpgrade),
        "[expand](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=expand"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=expand),
        "[pvCreate](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=pvCreate"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=pvCreate),
        "[pvcCreate](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=pvcCreate"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=pvcCreate),
        "[recurringJobAdd](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobAdd"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobAdd),
        "[recurringJobDelete](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobDelete"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobDelete),
        "[recurringJobList](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobList"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=recurringJobList),
        "[replicaRemove](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=replicaRemove"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=replicaRemove),
        "[snapshotBackup](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotBackup"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotBackup),
        "[snapshotCRCreate](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRCreate"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRCreate),
        "[snapshotCRDelete](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRDelete"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRDelete),
        "[snapshotCRGet](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRGet"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRGet),
        "[snapshotCRList](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRList"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCRList),
        "[snapshotCreate](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCreate"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotCreate),
        "[snapshotDelete](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotDelete"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotDelete),
        "[snapshotGet](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotGet"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotGet),
        "[snapshotList](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotList"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotList),
        "[snapshotPurge](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotPurge"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotPurge),
        "[snapshotRevert](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=snapshotRevert"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=snapshotRevert),
        "[trimFilesystem](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=trimFilesystem"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=trimFilesystem),
        "[updateBackupCompressionMethod](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateBackupCompressionMethod"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateBackupCompressionMethod),
        "[updateDataLocality](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateDataLocality"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateDataLocality),
        "[updateOfflineReplicaRebuilding](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateOfflineReplicaRebuilding"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateOfflineReplicaRebuilding),
        "[updateReplicaAutoBalance](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaAutoBalance"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaAutoBalance),
        "[updateReplicaCount](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaCount"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaCount),
        "[updateReplicaSoftAntiAffinity](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaSoftAntiAffinity"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaSoftAntiAffinity),
        "[updateReplicaZoneSoftAntiAffinity](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaZoneSoftAntiAffinity"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateReplicaZoneSoftAntiAffinity),
        "[updateSnapshotDataIntegrity](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateSnapshotDataIntegrity"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateSnapshotDataIntegrity),
        "[updateUnmapMarkSnapChainRemoved](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw#)": ["…/v1/volumes/longhorn-testvol-bl8klw?action=updateUnmapMarkSnapChainRemoved"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw?action=updateUnmapMarkSnapChainRemoved),
    },
    "backendStoreDriver": "v1",
    "backingImage": "",
    "backupCompressionMethod": "lz4",
    "backupStatus": [ ],
    "cloneStatus": {
        "snapshot": "",
        "sourceVolume": "",
        "state": "",
    },
    "conditions": {
        "restore": {
            "lastProbeTime": "",
            "lastTransitionTime": "2023-06-28T11:38:06Z",
            "message": "",
            "reason": "",
            "status": "False",
            "type": "restore",
        },
        "scheduled": {
            "lastProbeTime": "",
            "lastTransitionTime": "2023-06-28T11:38:06Z",
            "message": "",
            "reason": "",
            "status": "True",
            "type": "scheduled",
        },
        "toomanysnapshots": {
            "lastProbeTime": "",
            "lastTransitionTime": "2023-06-28T11:38:06Z",
            "message": "",
            "reason": "",
            "status": "False",
            "type": "toomanysnapshots",
        },
    },
    "controllers": [ 2 items
        {
            "actualSize": "4096",
            "address": "10.42.2.8",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "endpoint": "/dev/longhorn/longhorn-testvol-bl8klw",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "hostId": "ip-10-0-1-146",
            "instanceManagerName": "instance-manager-6c67701f85b0ae508d92d085b8b2c3ad",
            "isExpanding": false,
            "lastExpansionError": "",
            "lastExpansionFailedAt": "",
            "lastRestoredBackup": "",
            "name": "longhorn-testvol-bl8klw-e-413a828c",
            "requestedBackupRestore": "",
            "running": true,
            "size": "16777216",
            "unmapMarkSnapChainRemovedEnabled": false,
        },
        {
            "actualSize": "4096",
            "address": "10.42.1.9",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "endpoint": "/dev/longhorn/longhorn-testvol-bl8klw",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "hostId": "ip-10-0-1-21",
            "instanceManagerName": "instance-manager-408a4a130067c1351be9778cfa8b9ff7",
            "isExpanding": false,
            "lastExpansionError": "",
            "lastExpansionFailedAt": "",
            "lastRestoredBackup": "",
            "name": "longhorn-testvol-bl8klw-e-d3c442ff",
            "requestedBackupRestore": "",
            "running": true,
            "size": "16777216",
            "unmapMarkSnapChainRemovedEnabled": false,
        },
    ],
    "created": "2023-06-28 11:38:05 +0000 UTC",
    "currentImage": "longhornio/longhorn-engine:master-head",
    "dataLocality": "disabled",
    "dataSource": "",
    "disableFrontend": false,
    "diskSelector": [ ],
    "encrypted": false,
    "engineImage": "longhornio/longhorn-engine:master-head",
    "fromBackup": "",
    "frontend": "blockdev",
    "id": ["longhorn-testvol-bl8klw"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw),
    "kubernetesStatus": {
        "lastPVCRefAt": "",
        "lastPodRefAt": "",
        "namespace": "",
        "pvName": "",
        "pvStatus": "",
        "pvcName": "",
        "workloadsStatus": null,
    },
    "lastAttachedBy": "",
    "lastBackup": "",
    "lastBackupAt": "",
    "links": {
        "self": ["…/v1/volumes/longhorn-testvol-bl8klw"](http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw),
    },
    "migratable": true,
    "name": "longhorn-testvol-bl8klw",
    "nodeSelector": [ ],
    "numberOfReplicas": 3,
    "offlineReplicaRebuilding": "disabled",
    "offlineReplicaRebuildingRequired": false,
    "purgeStatus": [ 4 items
        {
            "actions": null,
            "error": "",
            "isPurging": false,
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-6801502c",
            "state": "",
        },
        {
            "actions": null,
            "error": "",
            "isPurging": false,
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-35f42942",
            "state": "",
        },
        {
            "actions": null,
            "error": "",
            "isPurging": false,
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-5ca1080c",
            "state": "",
        },
        {
            "actions": null,
            "error": "",
            "isPurging": false,
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-71c1a3b0",
            "state": "",
        },
    ],
    "ready": true,
    "rebuildStatus": [ ],
    "recurringJobSelector": null,
    "replicaAutoBalance": "ignored",
    "replicaSoftAntiAffinity": "ignored",
    "replicaZoneSoftAntiAffinity": "ignored",
    "replicas": [ 5 items
        {
            "address": "10.42.3.9",
            "backendStoreDriver": "v1",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "dataPath": "/var/lib/longhorn/replicas/longhorn-testvol-bl8klw-c13409ff",
            "diskID": "b318893d-402d-49d3-abc5-6ed557895b25",
            "diskPath": "/var/lib/longhorn/",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "failedAt": "",
            "hostId": "ip-10-0-1-39",
            "instanceManagerName": "instance-manager-0b165eb6a49550ac97473a12c0045a78",
            "mode": "RW",
            "name": "longhorn-testvol-bl8klw-r-35f42942",
            "running": true,
        },
        {
            "address": "10.42.1.9",
            "backendStoreDriver": "v1",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "dataPath": "/var/lib/longhorn/replicas/longhorn-testvol-bl8klw-4be8b763",
            "diskID": "7a05bd54-7bf8-4e75-abf2-05497610b825",
            "diskPath": "/var/lib/longhorn/",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "failedAt": "",
            "hostId": "ip-10-0-1-21",
            "instanceManagerName": "instance-manager-408a4a130067c1351be9778cfa8b9ff7",
            "mode": "",
            "name": "longhorn-testvol-bl8klw-r-5ca1080c",
            "running": true,
        },
        {
            "address": "10.42.1.9",
            "backendStoreDriver": "v1",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "dataPath": "/var/lib/longhorn/replicas/longhorn-testvol-bl8klw-4be8b763",
            "diskID": "7a05bd54-7bf8-4e75-abf2-05497610b825",
            "diskPath": "/var/lib/longhorn/",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "failedAt": "",
            "hostId": "ip-10-0-1-21",
            "instanceManagerName": "instance-manager-408a4a130067c1351be9778cfa8b9ff7",
            "mode": "RW",
            "name": "longhorn-testvol-bl8klw-r-6801502c",
            "running": true,
        },
        {
            "address": "10.42.3.9",
            "backendStoreDriver": "v1",
            "currentImage": "longhornio/longhorn-engine:master-head",
            "dataPath": "/var/lib/longhorn/replicas/longhorn-testvol-bl8klw-c13409ff",
            "diskID": "b318893d-402d-49d3-abc5-6ed557895b25",
            "diskPath": "/var/lib/longhorn/",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "failedAt": "",
            "hostId": "ip-10-0-1-39",
            "instanceManagerName": "instance-manager-0b165eb6a49550ac97473a12c0045a78",
            "mode": "",
            "name": "longhorn-testvol-bl8klw-r-71c1a3b0",
            "running": true,
        },
        {
            "address": "",
            "backendStoreDriver": "v1",
            "currentImage": "",
            "dataPath": "/var/lib/longhorn/replicas/longhorn-testvol-bl8klw-1c4c6b0d",
            "diskID": "d2828bc3-5989-4e00-8249-123f20ddad9d",
            "diskPath": "/var/lib/longhorn/",
            "engineImage": "longhornio/longhorn-engine:master-head",
            "failedAt": "2023-06-28T11:38:39Z",
            "hostId": "ip-10-0-1-146",
            "instanceManagerName": "",
            "mode": "",
            "name": "longhorn-testvol-bl8klw-r-a6b7051b",
            "running": false,
        },
    ],
    "restoreInitiated": false,
    "restoreRequired": false,
    "restoreStatus": [ 4 items
        {
            "actions": null,
            "backupURL": "",
            "error": "",
            "filename": "",
            "isRestoring": false,
            "lastRestored": "",
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-6801502c",
            "state": "",
        },
        {
            "actions": null,
            "backupURL": "",
            "error": "",
            "filename": "",
            "isRestoring": false,
            "lastRestored": "",
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-35f42942",
            "state": "",
        },
        {
            "actions": null,
            "backupURL": "",
            "error": "",
            "filename": "",
            "isRestoring": false,
            "lastRestored": "",
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-5ca1080c",
            "state": "",
        },
        {
            "actions": null,
            "backupURL": "",
            "error": "",
            "filename": "",
            "isRestoring": false,
            "lastRestored": "",
            "links": null,
            "progress": 0,
            "replica": "longhorn-testvol-bl8klw-r-71c1a3b0",
            "state": "",
        },
    ],
    "restoreVolumeRecurringJob": "ignored",
    "revisionCounterDisabled": false,
    "robustness": "degraded",
    "shareEndpoint": "",
    "shareState": "",
    "size": "16777216",
    "snapshotDataIntegrity": "ignored",
    "staleReplicaTimeout": 0,
    "standby": false,
    "state": "attached",
    "type": "volume",
    "unmapMarkSnapChainRemoved": "ignored",
    "volumeAttachment": {
        "attachments": {
            "test-attachment-ticket-lhgbeu": {
                "attachmentID": "test-attachment-ticket-lhgbeu",
                "attachmentType": "csi-attacher",
                "conditions": [
                    {
                        "lastProbeTime": "",
                        "lastTransitionTime": "2023-06-28T11:38:49Z",
                        "message": "The migrating attachment ticket is satisfied",
                        "reason": "",
                        "status": "True",
                        "type": "Satisfied",
                    },
                ],
                "nodeID": "ip-10-0-1-21",
                "parameters": {
                    "disableFrontend": "false",
                    "lastAttachedBy": "",
                },
                "satisfied": true,
            },
        },
        "volume": "longhorn-testvol-bl8klw",
    },

}

The length of controllers is not 1, and the length of replicas is not numberOfReplicas, so the test case failed.

To Reproduce

Run test case test_migration_with_*

Expected behavior

A clear and concise description of what you expected to happen.

Log or Support bundle

supportbundle_e4760442-f449-47a1-9bd1-dab5a00e97c5_2023-06-28T12-14-54Z.zip

Environment

  • Longhorn version: master-head or v1.5.x-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

Test results: https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/28/testReport/junit/tests/test_migration/test_migration_with_failed_replica/ https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-arm64/31/testReport/tests/test_migration/test_migration_with_rebuilding_replica/ https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/533/testReport/tests/test_migration/test_migration_with_rebuilding_replica/ https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/27/testReport/tests/test_migration/test_migration_with_restore_volume_nfs_/ https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/24/testReport/tests/test_migration/test_migration_with_unscheduled_replica/ https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/526/testReport/tests/test_migration/test_migration_with_rebuilding_replica/ https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-tests-sles-arm64/522/testReport/tests/test_migration/test_migration_with_unscheduled_replica/

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 22 (19 by maintainers)

Most upvoted comments

Ran the Test test_migration_with_unscheduled_replica and test_migration_with_failed_replica 20 times each with @PhanLe1010’s dev image, they passed on 1.5.x.

Update: test_migration_with_rebuilding_replica passed 5 times. Rest we can check once the PR is merged probably.

Root Cause Analysis

From almost all support bundles collected when the issue happen, I see that the migration is blocked by checkMigratingEngineSyncSnapshots at this step That step checks and waits for the snapshot chain in old engine is the same as inside new engine. However, sometime this condition is never met because the snapshot creation time mismatch like this: Screenshot from 2023-06-28 15-53-23

The snapshot created timestamp is an info fetch from one of the RW replica. This is the time that the snapshot file is created on that replica’s disk. So it is possible that they are different on different RW replicas. When each engine fetch them from a different RW replica, the creation timestamp might be different.

Proposal

The role of the function checkMigratingEngineSyncSnapshots is to wait for the snapshot chain of the new engine (which is empty at beginning) to be populated so that we don’t accidentally delete snapshot CRs. We should modify the checkMigratingEngineSyncSnapshots so that it only check and wait for all snapshot names in the old engine to appear in the new engine. The snapshot creation timestamp can be different and it is ok, we should not block the migration flow.

@khushboo-rancher Could you help to double check that this test plan still passed #5992 (comment) ?

Yes, tested on master-head (longhorn-manager 4a57a72) and v1.5.x-head (longhorn-manager a9bf977), this migration test plan still works.

Verified passed on v1.5.x-head (longhorn-manager ba2d3d1) by running test_migration_with_unscheduled_replica, test_migration_with_failed_replica, test_migration_with_restore_volume and test_migration_with_rebuilding_replica for 10 times.

All test cases passed: https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/33/

Thanks @yangchiu, The migration tests work on v1.5.x-head on my set up too.

@khushboo-rancher Could you help to double check that this test plan still passed https://github.com/longhorn/longhorn/issues/5992#issuecomment-1563873360 ?

Verified passed on master-head (longhorn-manager 9bc7e07) by running test_migration_with_unscheduled_replica, test_migration_with_failed_replica, test_migration_with_restore_volume and test_migration_with_rebuilding_replica for 10 times.

All test cases passed: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4384/

Waiting for v1.5.x-head test result now.

You are correct @innobead It should not happen in 1.4.x and 1.3.x.

This issue is a side effect of this https://github.com/longhorn/longhorn-manager/pull/1922

@roger-ryao could it be a timeout issue? Was the volume eventually migrated to the new node?

Probably not. It’s stuck in the same state forever: http://34.228.2.191:30007/#/volume/longhorn-testvol-bl8klw http://34.228.2.191:30007/v1/volumes/longhorn-testvol-bl8klw

@yangchiu is this also recognized before in other release branches?

cc @longhorn/qa

Hi @innobead : I observed that the test case test_migration_with_failed_replica failed on 1.5.0-RC3 for the first time. I didn’t encounter this failure in my personal test records for 1.4.x and 1.3.x.