longhorn: [BUG] A race after a node reboot leads to I/O errors with migratable volumes
Describe the bug (š if you encounter this issue)
A hard reboot on a Harvester node can lead to a race that results in I/O errors for running VMs. The VMs cannot proceed without being stopped and started again.
Investigation is ongoing. I will update this description as I understand more.
It seems like our migration logic makes a bad decision when there is a replica starting on an old node but delayed by backing image download. The volume finishes attaching to the new node while ignoring the replica. A workload on the new node can even proceed to start using the volume. However, once the delay is over and the replica begins rebuilding on the old node, this logic crashes the engine on the new node, leading to I/O errors and stuck VMs.
Itās not clear whether this can be reproduced without:
- A hung VolumeAttachment on the old node like in https://github.com/longhorn/longhorn/issues/6919. This keeps us in a permanent state of migration, when there is really no need for migration at all.
- Backing image downloads delaying replica rebuilding.
To Reproduce
These steps donāt reproduce 100% of the time. Working on a simpler, more effective set.
- Run a Harvester VM on one node.
- Hard reboot the node.
- Wait for the VM to come up on a different node, but experience I/O errors and remain permanently paused.
Expected behavior
Support bundle for troubleshooting
Environment
- Longhorn version: v1.4.3
- Harvester version: v1.2.0
- Two nodes:
pnws6(original home of VM pod)bt64n(eventually home of VM pod)- Longhorn volume used as the root disk for the VM pod
Additional context
Order of events:
- Node
pnws6is down unexpectedly. - Longhorn manager on node
bt64neventually notices and marks instances asunknown. - Node
pnws6comes back. - Longhorn manager on node
pnws6marks instances aserror. - Longhorn manager on node
pnws6decides to reattach volume. Should it do that? Need to check this part.
2023-10-24T19:47:45.523938549Z time="2023-10-24T19:47:45Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:47:45.524094543Z time="2023-10-24T19:47:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d\", UID:\"22e45430-af68-41df-b114-bd7b9495ed7b\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16032097\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d dead unexpectedly, reattach the volume"
- A replica and an engine are started on
pnws6. - The replica canāt be added immediately to the engine because it is waiting on a backing image download. However, the engine can still run because there is a replica on
bt64n.
2023-10-24T19:48:01.681196556Z time="2023-10-24T19:48:01Z" level=info msg="Creating instance pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471"
2023-10-24T19:48:01.681237847Z time="2023-10-24T19:48:01Z" level=debug msg="Replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 is waiting for backing image default-image-rz9sk downloading file to node harvester-pnws6 disk 75b99d80-2fd8-449f-a459-13f89c4752c2, the current state is unknown" controller=longhorn-replica dataPath= node=harvester-pnws6 nodeID=harvester-pnws6 ownerID=harvester-pnws6 replica=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471
2023-10-24T19:48:01.681269333Z time="2023-10-24T19:48:01Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471\", UID:\"83460c3f-1102-40d7-ab05-4ea12619ddda\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16032719\", FieldPath:\"\"}): type: 'Normal' reason: 'Start' Starts pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471"
- The VM layer decides to start the VM pod on
bt64n. - The CSI plugin attaches the volume to
bt64n. Since the volume is already attached topnws6, Longhorn considers this a migration. The VM layer does not consider it to be a migration. To it, the VM pod is dead and can be safely started somewhere else. - The migration engine is confused by the existence of a replica not in the engine mode map of the engine of
pnws6, but it allows the migration to proceed.
2023-10-24T19:48:19.459475928Z time="2023-10-24T19:48:19Z" level=warning msg="unexpected mode for the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471, will ignore it and continue migration" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
- The backing image download finishes and the replica on
pnws6prepares to rebuild. - Eventually, the migration engine is ready, the attachment to
bt64ncompletes, and the VM pod starts. We can never actually complete the migration because Kubernetes does not ask us to detach frompnws6. - The VM pod starts using the volume.
- Replica rebuilding starts. As soon as Longhorn manager on
pnws6detects a RO replica in its engineās replica mode map, it reverts the migration and crashes the migration engine onbt64n.
2023-10-24T19:48:34.828775246Z time="2023-10-24T19:48:34Z" level=debug msg="Finished snapshot purge, will start rebuilding then" controller=longhorn-engine engine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8 node=harvester-pnws6 volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.850506663Z time="2023-10-24T19:48:34Z" level=warning msg="unexpected mode for the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471, will ignore it and continue migration" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.852754435Z time="2023-10-24T19:48:34Z" level=info msg="volume migration engine is ready" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:34.950305262Z time="2023-10-24T19:48:34Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8\", UID:\"0c0a298c-2d24-4b53-9fd9-b859f9d22558\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16033390\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Start rebuilding replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 with Address 10.52.1.196:10075 for normal engine pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8 and volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d"
2023-10-24T19:48:36.629692166Z time="2023-10-24T19:48:36Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-b04a64c8\", UID:\"0c0a298c-2d24-4b53-9fd9-b859f9d22558\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"16033428\", FieldPath:\"\"}): type: 'Normal' reason: 'Rebuilding' Detected rebuilding replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 (10.52.1.196:10075)"
2023-10-24T19:48:36.719829650Z time="2023-10-24T19:48:36Z" level=warning msg="Error syncing Longhorn snapshot longhorn-system/fdc3b3e4-ab1e-4149-bdfe-fb2d84076819" controller=longhorn-snapshot error="longhorn-snapshot: failed to sync snapshot longhorn-system/fdc3b3e4-ab1e-4149-bdfe-fb2d84076819: getTheOnlyEngineCRforSnapshot: found more than 1 engines for volume pvc-476a4873-d166-4e6e-8bac-ccf01083e14d" node=harvester-pnws6
2023-10-24T19:48:36.720044621Z time="2023-10-24T19:48:36Z" level=debug msg="Need to revert rather than starting migration since the current replica pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-r-a1ae3471 is mode WriteOnly, which means the rebuilding is in progress" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
2023-10-24T19:48:36.720112171Z time="2023-10-24T19:48:36Z" level=warning msg="The migration engine or all migration replicas crashed, will clean them up now" accessMode=rwx controller=longhorn-volume frontend=blockdev migratable=true migrationEngine=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d-e-c8c76749 migrationNodeID=harvester-bt64n node=harvester-pnws6 owner=harvester-pnws6 shareEndpoint= shareState= state=attached volume=pvc-476a4873-d166-4e6e-8bac-ccf01083e14d
- The VM experiences I/O errors attempting to write to a crashed volume.
- Longhorn reattaches the volume to
bt64n, but the VM can never recover without being restarted.
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Reactions: 4
- Comments: 19 (18 by maintainers)
Reproduce (somewhat complicated):
This may have been overstated. When I originally detected the issue, I was actually looking into a case in which a replicaās metadata was messed up after the migration completed. I was not able to trace it directly back to this issue, but I realized while investigating that there was an avenue for rebuilding to kick off after the migration engine became ready.
Since we have already allowed the volume to attach to the new node with the new migration engine, the new node may be writing to it. (Usually not, but in strange cases like the Harvester cluster down/up scenario combined with the upstream kubelet bug, it is.) Now, we start rebuilding the volume with the old engine. In fact, if the rebuilding is fast enough, maybe we can even finish the rebuilding a kick off another snapshot purge before the WO replica is detected and the migration is crashed? (I would need to check this part.)
Reopening this one because I found some undesirable behavior when testing its effects in Harvester cluster down/up scenarios.
My PR prevented the volume controller from adding a replica to an āoldā engineās
spec.replicaAddressMapif the associated volume was migrating. This prevented the easily reproducible behavior in this issueās test steps.However, thereās an apparently less likely race condition that can still occur. In this one:
spec.replicaAddressMap.autoCleanupSystemGeneratedSnapshotsis enabled, so we wait 10 seconds for a snapshot purge.status.replicaModeMap, which the replica hasnāt been added to yet.spec.replicaAddressMap. This replica is already there.status.replicaModeMapand crashes the migration engine. (This is the thing we wanted to avoid in the first place).Iām not sure yet whether this one is easily reproducible, but Iām testing a fix in my Harvester setup.
@yangchiu In addition to the test scenario, could you please run related (Node down, reboot, cluster restart) negative test cases multiple times on our CI?
Currently, for migratable volumes, we return attachment success if:
volume.status.state == attached, andengine.status.endpoint != "".The top condition is generally always true during migration (since the volume is attached to the old node). The bottom condition becomes true when the migration engine is running successfully. However, even after we return attachment success, we may still revert the migration if anything about the migration āstateā changes. In this case, we crashed the migration engine because we found a new RO replica in the old engineās
replicaModeMap, but we would have also crashed it if, for any other reason, the replicas in the migration engineāsreplicaAddressMapdid not match the old engineās.We canāt intentionally crash the engine after we already reported it attached, or we will see issues like this one.
Potential solutions I thought up:.
engine.spec.replicaAddressMapif the volume is migrating somewhere around https://github.com/longhorn/longhorn-manager/blob/432b6d40f02a6fdffd8389df9d779016775f1d3a/controller/volume_controller.go#L1628-L1631. This disables rebuilding while a migration is ongoing (which could be detrimental for HA), but it should prevent the exact crash outlined in this issue.engine.spec.replicaAddressMapcould still become out of sync between old and migration engines, causing the crash.@shuo-wu, much of the code Iām digging around was refactored in https://github.com/longhorn/longhorn/issues/3206 to prevent previous issues we saw with migration. Do you like any of these ideas or have a better one off the top of your head?
We may be able to clean up the logic so that we can avoid crashing the volume on the migrationNode (need to investigate this more). However, if we are never asked to detach from the old node, we will be stuck migrating forever. This will lead to future issues (e.g. a migration to a āthirdā node can never be triggered).