longhorn: [BUG] A race after a node reboot leads to I/O errors with migratable volumes

Describe the bug (šŸ› if you encounter this issue)

A hard reboot on a Harvester node can lead to a race that results in I/O errors for running VMs. The VMs cannot proceed without being stopped and started again.

Investigation is ongoing. I will update this description as I understand more.

It seems like our migration logic makes a bad decision when there is a replica starting on an old node but delayed by backing image download. The volume finishes attaching to the new node while ignoring the replica. A workload on the new node can even proceed to start using the volume. However, once the delay is over and the replica begins rebuilding on the old node, this logic crashes the engine on the new node, leading to I/O errors and stuck VMs.

It’s not clear whether this can be reproduced without:

  • A hung VolumeAttachment on the old node like in https://github.com/longhorn/longhorn/issues/6919. This keeps us in a permanent state of migration, when there is really no need for migration at all.
  • Backing image downloads delaying replica rebuilding.

To Reproduce

These steps don’t reproduce 100% of the time. Working on a simpler, more effective set.

  1. Run a Harvester VM on one node.
  2. Hard reboot the node.
  3. Wait for the VM to come up on a different node, but experience I/O errors and remain permanently paused.

Expected behavior

Support bundle for troubleshooting

Environment

  • Longhorn version: v1.4.3
  • Harvester version: v1.2.0
  • Two nodes:
  • pnws6 (original home of VM pod)
  • bt64n (eventually home of VM pod)
  • Longhorn volume used as the root disk for the VM pod

Additional context

Order of events:

  1. Node pnws6 is down unexpectedly.
  2. Longhorn manager on node bt64n eventually notices and marks instances as unknown.
  3. Node pnws6 comes back.
  4. Longhorn manager on node pnws6 marks instances as error.
  5. Longhorn manager on node pnws6 decides to reattach volume. Should it do that? Need to check this part.
2023-10-24T19:47:45.523938549Z time="2023-10-24T19:47:45Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:47:45.524094543Z time="2023-10-24T19:47:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\", UID:\"22e45430-af68-41df-b114-bd7b9495ed7b\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16032097\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d dead unexpectedly, reattach the volume"
  1. A replica and an engine are started on pnws6.
  2. The replica can’t be added immediately to the engine because it is waiting on a backing image download. However, the engine can still run because there is a replica on bt64n.
2023-10-24T19:48:01.681196556Z time="2023-10-24T19:48:01Z" level=info msg="Creating instance pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471"
2023-10-24T19:48:01.681237847Z time="2023-10-24T19:48:01Z" level=debug msg="Replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 is waiting for backing image default-image-rz9sk downloading file to node harvester-pnws6 disk 75b99d80-2fd8-449f-a459-13f89c4752c2, the current state is unknown" controller=longhorn-replica dataPath= node=harvester-pnws6 nodeID=harvester-pnws6 ownerID=harvester-pnws6 replica=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471
2023-10-24T19:48:01.681269333Z time="2023-10-24T19:48:01Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471\", UID:\"83460c3f-1102-40d7-ab05-4ea12619ddda\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16032719\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471"
  1. The VM layer decides to start the VM pod on bt64n.
  2. The CSI plugin attaches the volume to bt64n. Since the volume is already attached to pnws6, Longhorn considers this a migration. The VM layer does not consider it to be a migration. To it, the VM pod is dead and can be safely started somewhere else.
  3. The migration engine is confused by the existence of a replica not in the engine mode map of the engine of pnws6, but it allows the migration to proceed.
2023-10-24T19:48:19.459475928Z time="2023-10-24T19:48:19Z" level=warning msg="unexpected mode  for the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471, will ignore it and continue migration" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
  1. The backing image download finishes and the replica on pnws6 prepares to rebuild.
  2. Eventually, the migration engine is ready, the attachment to bt64n completes, and the VM pod starts. We can never actually complete the migration because Kubernetes does not ask us to detach from pnws6.
  3. The VM pod starts using the volume.
  4. Replica rebuilding starts. As soon as Longhorn manager on pnws6 detects a RO replica in its engine’s replica mode map, it reverts the migration and crashes the migration engine on bt64n.
2023-10-24T19:48:34.828775246Z time="2023-10-24T19:48:34Z" level=debug msg="Finished snapshot purge, will start rebuilding then" controller=longhorn-engine engine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8 node=harvester-pnws6 volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.850506663Z time="2023-10-24T19:48:34Z" level=warning msg="unexpected mode  for the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471, will ignore it and continue migration" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.852754435Z time="2023-10-24T19:48:34Z" level=info msg="volume migration engine is ready" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.950305262Z time="2023-10-24T19:48:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8\", UID:\"0c0a298c-2d24-4b53-9fd9-b859f9d22558\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16033390\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Start rebuilding replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 with Address 10.52.1.196:10075 for normal engine pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8 and volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d"
2023-10-24T19:48:36.629692166Z time="2023-10-24T19:48:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8\", UID:\"0c0a298c-2d24-4b53-9fd9-b859f9d22558\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16033428\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Detected rebuilding replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 (10.52.1.196:10075)"
2023-10-24T19:48:36.719829650Z time="2023-10-24T19:48:36Z" level=warning msg="Error syncing Longhorn snapshot longhorn-system/fdc3b3e4-ab1e-4149-bdfe-fb2d84076819" controller=longhorn-snapshot error="longhorn-snapshot: failed to sync snapshot longhorn-system/fdc3b3e4-ab1e-4149-bdfe-fb2d84076819: getTheOnlyEngineCRforSnapshot: found more than 1 engines for volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" node=harvester-pnws6
2023-10-24T19:48:36.720044621Z time="2023-10-24T19:48:36Z" level=debug msg="Need to revert rather than starting migration since the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 is mode WriteOnly, which means the rebuilding is in progress" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:36.720112171Z time="2023-10-24T19:48:36Z" level=warning msg="The migration engine or all migration replicas crashed, will clean them up now" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
  1. The VM experiences I/O errors attempting to write to a crashed volume.
  2. Longhorn reattaches the volume to bt64n, but the VM can never recover without being restarted.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Reactions: 4
  • Comments: 19 (18 by maintainers)

Most upvoted comments

Reproduce (somewhat complicated):

  1. Install Longhorn on a three node cluster.
  2. Create a migratable storage class that uses a relatively large backing image (I used 1 GiB).
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: test-sc
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
  numberOfReplicas: "3"
  migratable: "true"
  backingImage: "test"
  backingImageDataSourceType: "download"
  backingImageDataSourceParameters: '{"url": "https://eweber-backing-image-test.sfo3.digitaloceanspaces.com/test.raw"}'
  1. Create a PVC that references the storage class.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Block
  storageClassName: test-sc
  resources:
    requests:
      storage: 1Gi
  1. Determine the name of the PV bound to the PVC.
eweber@laptop:~/longhorn/examples/rwx> k get pvc
NAME       STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
test-pvc   Bound    pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8   1Gi        RWX            test-sc        49m
  1. Choose a node to manually attach the PV to (this simulates the ā€œoldā€ node in the issue description).
eweber@laptop:~/longhorn/examples/rwx> kl get node
NAME                                STATUS   ROLES                       AGE    VERSION
eweber-v125-master-e5399264-88sld   Ready    control-plane,etcd,master   111d   v1.25.12+rke2r1
eweber-v125-worker-e472db53-9kz5b   Ready    worker                      111d   v1.25.12+rke2r1
eweber-v125-worker-e472db53-n8vj8   Ready    worker                      111d   v1.25.12+rke2r1
eweber-v125-worker-e472db53-xjmzr   Ready    worker                      107d   v1.25.12+rke2r1
  1. Create a volume attachment to manually attach the bound PV to the chosen node. (Modify the appropriate fields.)
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  name: test-va-1
spec:
  attacher: driver.longhorn.io
  nodeName: eweber-v125-worker-e472db53-9kz5b # Modify for cluster.
  source:
    persistentVolumeName: pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8 # Modify for cluster.
eweber@laptop:~/longhorn/examples/rwx> k get volumeattachment
NAME        ATTACHER             PV                                         NODE                                ATTACHED   AGE
test-va-1   driver.longhorn.io   pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8   eweber-v125-worker-e472db53-9kz5b   true       51m
  1. Cordon the old node and delete its running replica.
eweber@laptop:~/longhorn/examples/rwx> k cordon eweber-v125-worker-e472db53-9kz5b
node/eweber-v125-worker-e472db53-9kz5b cordoned

eweber@laptop:~/longhorn/examples/rwx> kl get replica
NAME                                                  STATE     NODE                                DISK                                   INSTANCEMANAGER                                     IMAGE                                    AGE
pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-0feed53c   running   eweber-v125-worker-e472db53-9kz5b   960eccfd-bd02-47eb-995e-ab569503424c   instance-manager-de0a44c527b5c06489d7a009124e4fd3   longhornio/longhorn-engine:master-head   56m
pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-3b9e3cb8   running   eweber-v125-worker-e472db53-xjmzr   0c519f3b-b88e-4e1a-830b-cdb9df99152c   instance-manager-c839c5bbab46d866e8380cbdb9460b4a   longhornio/longhorn-engine:master-head   2m38s
pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-6a299cb4   running   eweber-v125-worker-e472db53-n8vj8   d859aad8-29ff-43a6-bb7e-f09ce6a2d93d   instance-manager-211d7cd4337ea60af17807834c0c7055   longhornio/longhorn-engine:master-head   26m

eweber@laptop:~/longhorn/examples/rwx> kl delete replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-0feed53c
replica.longhorn.io "pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-0feed53c" deleted
  1. Edit the backing image spec so that a copy of the backing image no longer exists on the old node.
eweber@laptop:~/longhorn/examples/rwx> kl get lhn -oyaml eweber-v125-worker-e472db53-9kz5b | grep UUID
      diskUUID: 960eccfd-bd02-47eb-995e-ab569503424c
eweber@laptop:~/longhorn/examples/rwx> kl edit lhbi
Edit cancelled, no changes made.
eweber@laptop:~/longhorn/examples/rwx> kl get lhbi -oyaml
apiVersion: v1
items:
- apiVersion: longhorn.io/v1beta2
  kind: BackingImage
  metadata:
    creationTimestamp: "2023-10-27T19:30:31Z"
    finalizers:
    - longhorn.io
    generation: 13
    labels:
      longhorn.io/component: backing-image
      longhorn.io/managed-by: longhorn-manager
    name: test
    namespace: longhorn-system
    resourceVersion: "82175437"
    uid: 23f93b3d-4e09-4515-95de-6a0fa5645d0c
  spec:
    checksum: ""
    disks:
      0c519f3b-b88e-4e1a-830b-cdb9df99152c: ""
      d859aad8-29ff-43a6-bb7e-f09ce6a2d93d: ""
    sourceParameters:
      url: https://eweber-backing-image-test.sfo3.digitaloceanspaces.com/test.raw
    sourceType: download
  status:
    checksum: 64d15f7127000ae48594b933954c4e929eaef1046e07e2ddae75c1f442aeeed88516661cf639a6560274f3d8edcbf9c32b8d69e84992ff4d0b888438ea38d00a
    diskFileStatusMap:
      0c519f3b-b88e-4e1a-830b-cdb9df99152c:
        lastStateTransitionTime: "2023-10-30T15:52:45Z"
        message: ""
        progress: 100
        state: ready
      d859aad8-29ff-43a6-bb7e-f09ce6a2d93d:
        lastStateTransitionTime: "2023-10-27T22:16:02Z"
        message: ""
        progress: 100
        state: ready
    diskLastRefAtMap: {}
    ownerID: eweber-v125-worker-e472db53-9kz5b
    size: 1073741824
    uuid: 715aca40
  1. Prepare, but DO NOT APPLY a pod that will run on a ā€œnewā€ node and use the volume already attached to the old node.
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
  namespace: default
spec:
  restartPolicy: Always
  containers:
    - name: volume-test
      image: nginx:stable-alpine
      imagePullPolicy: IfNotPresent
      volumeDevices:
        - devicePath: /dev/longhorn/test-volume
          name: test-volume
      args: ["sh", "-c", "set -e; while true; do dd if=/dev/urandom of=/dev/longhorn/test-volume bs=512 count=1; sleep 1; done"]
  nodeName: eweber-v125-worker-e472db53-xjmzr # Modify for cluster.
  volumes:
    - name: test-volume
      persistentVolumeClaim:
        claimName: test-pvc
  1. At this point:
  • There is a degraded volume with a replica that cannot be scheduled because a node is cordoned.
  • When the node is uncordoned, it will have to redownload the backing image.
  • The volume is attached to the cordoned node and has a running engine.
  1. Uncordon the node, wait a few seconds, and apply the pod.
eweber@laptop:~/longhorn/examples/rwx> k uncordon eweber-v125-worker-e472db53-9kz5b; sleep 3; k apply -f 6961-pod.yaml 
node/eweber-v125-worker-e472db53-9kz5b uncordoned
pod/test-pod created
  1. Wait for for some time. First, the pod begins running and is healthy. But, after some time, the pod crashes.
eweber@laptop:~/longhorn/examples/rwx> k get pod
NAME       READY   STATUS             RESTARTS      AGE
test-pod   0/1     CrashLoopBackOff   2 (27s ago)   118s

eweber@laptop:~/longhorn/examples/rwx> k logs test-pod 
dd: can't open '/dev/longhorn/test-volume': No such device or address
  1. Check the logs to confirm the order of events:
# The replica tries to run on the old node.
[longhorn-manager-v26xg] time="2023-10-30T16:27:55Z" level=info msg="Schedule replica to node eweber-v125-worker-e472db53-9kz5b" dataDirectoryName=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-6c6f0c0a disk=960eccfd-bd02-47eb-995e-ab569503424c diskPath=/var/lib/longhorn/ replica=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e
[longhorn-manager-v26xg] time="2023-10-30T16:27:55Z" level=info msg="Creating instance pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e"
[longhorn-manager-v26xg] time="2023-10-30T16:27:55Z" level=info msg="Asking backing image test to download file to node eweber-v125-worker-e472db53-9kz5b disk 960eccfd-bd02-47eb-995e-ab569503424c" controller=longhorn-replica node=eweber-v125-worker-e472db53-9kz5b nodeID=eweber-v125-worker-e472db53-9kz5b ownerID=eweber-v125-worker-e472db53-9kz5b replica=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e

# The replica can't actually start until the backing image is downloaded.
[longhorn-manager-v26xg] time="2023-10-30T16:27:55Z" level=info msg="Waiting for backing image test to download file to node eweber-v125-worker-e472db53-9kz5b disk 960eccfd-bd02-47eb-995e-ab569503424c, the current state is " controller=longhorn-replica node=eweber-v125-worker-e472db53-9kz5b nodeID=eweber-v125-worker-e472db53-9kz5b ownerID=eweber-v125-worker-e472db53-9kz5b replica=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e

# The migration process starts.
[longhorn-manager-v26xg] time="2023-10-30T16:27:59Z" level=info msg="Cloned a new matching replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-3b6ac38c from pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-3b9e3cb8" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8
[longhorn-manager-v26xg] time="2023-10-30T16:27:59Z" level=info msg="Cloned a new matching replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-deaa372c from pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-6a299cb4" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8

# The migration process is a bit confused by the state of the waiting replica.
[longhorn-manager-v26xg] time="2023-10-30T16:27:59Z" level=warning msg="Unexpected mode  for the current replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e, will ignore it and continue migration" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-1 migrationNodeID=eweber-v125-worker-e472db53-xjmzr node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8

# The migration engine becomes ready. Attachment is allowed to complete.
[longhorn-manager-v26xg] time="2023-10-30T16:28:04Z" level=info msg="volume migration engine is ready" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-1 migrationNodeID=eweber-v125-worker-e472db53-xjmzr node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8

# The backing image finished downloading and rebuilding begins. It is very quick.
[longhorn-manager-v26xg] time="2023-10-30T16:29:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-0\", UID:\"27764883-7d66-4a0e-9266-71363721c433\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"82178850\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Start rebuilding replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e with Address 10.42.59.178:10000 for normal engine pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-0 and volume pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8"
[longhorn-manager-v26xg] time="2023-10-30T16:29:12Z" level=info msg="Starting snapshot purge after rebuilding" controller=longhorn-engine engine=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-0 node=eweber-v125-worker-e472db53-9kz5b volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8
[longhorn-manager-v26xg] time="2023-10-30T16:29:12Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-0\", UID:\"27764883-7d66-4a0e-9266-71363721c433\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"82178850\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilt' Replica pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-r-7f17614e with Address 10.42.59.178:10000 has been rebuilt for volume pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8"

# This time, the old replica is never caught RO in the `engine.status.replicaModeMap` (too quick/bad timing).
# Instead, we simply detect a newly rebuilt replica. 
# The result is the same. 
# We intentionally crash the migrationEngine (which is already in use).
[longhorn-manager-v26xg] time="2023-10-30T16:29:14Z" level=warning msg="The current available migration replicas do not match the record in the migration engine status, will restart the migration engine then update the replica map" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8-e-1 migrationNodeID=eweber-v125-worker-e472db53-xjmzr node=eweber-v125-worker-e472db53-9kz5b owner=eweber-v125-worker-e472db53-9kz5b shareEndpoint= shareState= state=attached volume=pvc-7d07e98d-f5a1-4194-894f-98c7d81a43a8

Regarding to the new race https://github.com/longhorn/longhorn/issues/6961#issuecomment-1854811140, Could you explain a bit more how the data can be corrupted at this step ?

This may have been overstated. When I originally detected the issue, I was actually looking into a case in which a replica’s metadata was messed up after the migration completed. I was not able to trace it directly back to this issue, but I realized while investigating that there was an avenue for rebuilding to kick off after the migration engine became ready.

Since we have already allowed the volume to attach to the new node with the new migration engine, the new node may be writing to it. (Usually not, but in strange cases like the Harvester cluster down/up scenario combined with the upstream kubelet bug, it is.) Now, we start rebuilding the volume with the old engine. In fact, if the rebuilding is fast enough, maybe we can even finish the rebuilding a kick off another snapshot purge before the WO replica is detected and the migration is crashed? (I would need to check this part.)

Reopening this one because I found some undesirable behavior when testing its effects in Harvester cluster down/up scenarios.

My PR prevented the volume controller from adding a replica to an ā€œoldā€ engine’s spec.replicaAddressMap if the associated volume was migrating. This prevented the easily reproducible behavior in this issue’s test steps.

However, there’s an apparently less likely race condition that can still occur. In this one:

  • The volume controller adds a replica to spec.replicaAddressMap.
  • The replica must be rebuilt, but autoCleanupSystemGeneratedSnapshots is enabled, so we wait 10 seconds for a snapshot purge.
  • A migration is requested.
  • Existing logic preventing the migration engine from starting if a rebuild is ongoing is ineffective, because it check’s the engine’s status.replicaModeMap, which the replica hasn’t been added to yet.
  • New logic from my PR preventing the replica from rebuilding is ineffective, because it prevents replicas from being added to spec.replicaAddressMap. This replica is already there.
  • The engine controller adds the replica to the engine, kicking off a rebuild during the migration. (I think this can lead to potential data corruption.)
  • Eventually, the volume controller detects the extra replica in status.replicaModeMap and crashes the migration engine. (This is the thing we wanted to avoid in the first place).

I’m not sure yet whether this one is easily reproducible, but I’m testing a fix in my Harvester setup.

@yangchiu In addition to the test scenario, could you please run related (Node down, reboot, cluster restart) negative test cases multiple times on our CI?

Currently, for migratable volumes, we return attachment success if:

  • The volume.status.state == attached, and
  • There is an engine on the attaching node with engine.status.endpoint != "".

The top condition is generally always true during migration (since the volume is attached to the old node). The bottom condition becomes true when the migration engine is running successfully. However, even after we return attachment success, we may still revert the migration if anything about the migration ā€œstateā€ changes. In this case, we crashed the migration engine because we found a new RO replica in the old engine’s replicaModeMap, but we would have also crashed it if, for any other reason, the replicas in the migration engine’s replicaAddressMap did not match the old engine’s.

We can’t intentionally crash the engine after we already reported it attached, or we will see issues like this one.

Potential solutions I thought up:.

@shuo-wu, much of the code I’m digging around was refactored in https://github.com/longhorn/longhorn/issues/3206 to prevent previous issues we saw with migration. Do you like any of these ideas or have a better one off the top of your head?

We may be able to clean up the logic so that we can avoid crashing the volume on the migrationNode (need to investigate this more). However, if we are never asked to detach from the old node, we will be stuck migrating forever. This will lead to future issues (e.g. a migration to a ā€œthirdā€ node can never be triggered).