longhorn: [BUG] After crashed engine process, volume stuck in `Unknown` state
Describe the bug (๐ if you encounter this issue)
Test case test_recurring_jobs_when_volume_detached_unexpectedly randomly fails intermittently on master-head and v1.5.x-head with a failure rate of approximately 2 out of 20.
E AssertionError: value = healthy
E.volume[key] = unknown
E.volume = {
'accessMode': 'rwo',
'backendStoreDriver': 'v1',
'backingImage': '',
'backupCompressionMethod': 'lz4',
'backupStatus': [{
'backupURL': '',
'error': 'proxyServer=10.42.2.9:8501 destination=10.42.2.9:10010: failed to get backup-97f0ecc4095b460b backup status: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
'progress': 54,
'replica': 'tcp://10.42.1.10:10000',
'size': '',
'snapshot': 'recurrin-ff62e23a-fa5d-4c60-8214-72b854e38f02',
'state': 'Error'
}, {
'backupURL': '',
'error': 'error getting snapshot \'recurrin-cf217437-01d7-4f1b-acc9-1fd9d513632c\', engine \'longhorn-testvol-0gbgrf-e-b242988d\': proxyServer=10.42.1.10:8501 destination=10.42.2.9:10010: failed to list snapshots: rpc error: code = Unknown desc = failed to list replicas for volume 10.42.2.9:10010: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 10.42.2.9:10010: connect: connection refused"',
'progress': 0,
'replica': '',
'size': '',
'snapshot': '',
'state': 'Error'
}],
'cloneStatus': {
'snapshot': '',
'sourceVolume': '',
'state': ''
},
'conditions': {
'restore': {
'lastProbeTime': '',
'lastTransitionTime': '2023-09-12T06:30:13Z',
'message': '',
'reason': '',
'status': 'False'
},
'scheduled': {
'lastProbeTime': '',
'lastTransitionTime': '2023-09-12T06:30:13Z',
'message': '',
'reason': '',
'status': 'True'
},
'toomanysnapshots': {
'lastProbeTime': '',
'lastTransitionTime': '2023-09-12T06:30:13Z',
'message': '',
'reason': '',
'status': 'False'
}
},
'controllers': [{
'actualSize': '1150484480',
'address': '',
'currentImage': '',
'endpoint': '',
'engineImage': 'longhornio/longhorn-engine:master-head',
'hostId': '',
'instanceManagerName': '',
'isExpanding': False,
'lastExpansionError': '',
'lastExpansionFailedAt': '',
'lastRestoredBackup': '',
'name': 'longhorn-testvol-0gbgrf-e-b242988d',
'requestedBackupRestore': '',
'running': False,
'size': '2147483648',
'unmapMarkSnapChainRemovedEnabled': False
}],
'created': '2023-09-12 06:30:12 +0000 UTC',
'currentImage': 'longhornio/longhorn-engine:master-head',
'dataLocality': 'disabled',
'dataSource': '',
'disableFrontend': False,
'diskSelector': [],
'encrypted': False,
'engineImage': 'longhornio/longhorn-engine:master-head',
'fromBackup': '',
'frontend': 'blockdev',
'kubernetesStatus': {
'lastPVCRefAt': '',
'lastPodRefAt': '2023-09-12T06:30:54Z',
'namespace': 'default',
'pvName': 'longhorn-testvol-0gbgrf-pv',
'pvStatus': 'Bound',
'pvcName': 'longhorn-testvol-0gbgrf-pvc',
'workloadsStatus': [{
'podName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88-xmxzx',
'podStatus': 'Succeeded',
'workloadName': 'longhorn-testvol-0gbgrf-dep-56bf7bf88',
'workloadType': 'ReplicaSet'
}]
},
'lastAttachedBy': '',
'lastBackup': '',
'lastBackupAt': '',
'migratable': False,
'name': 'longhorn-testvol-0gbgrf',
'nodeSelector': [],
'numberOfReplicas': 3,
'offlineReplicaRebuilding': 'disabled',
'offlineReplicaRebuildingRequired': False,
'purgeStatus': None,
'ready': True,
'rebuildStatus': [],
'recurringJobSelector': None,
'replicaAutoBalance': 'ignored',
'replicaDiskSoftAntiAffinity': 'ignored',
'replicaSoftAntiAffinity': 'ignored',
'replicaZoneSoftAntiAffinity': 'ignored',
'replicas': [{
'address': '',
'backendStoreDriver': 'v1',
'currentImage': '',
'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-0ff7d9b8',
'diskID': '2a1fc81e-3ab8-463f-a818-b008f7e858a3',
'diskPath': '/var/lib/longhorn/',
'engineImage': 'longhornio/longhorn-engine:master-head',
'failedAt': '',
'hostId': 'ip-10-0-2-49',
'instanceManagerName': '',
'mode': '',
'name': 'longhorn-testvol-0gbgrf-r-0600c183',
'running': False
}, {
'address': '',
'backendStoreDriver': 'v1',
'currentImage': '',
'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-5b039c13',
'diskID': '51cbadf6-8b10-438a-b1ce-8ae00126e6ce',
'diskPath': '/var/lib/longhorn/',
'engineImage': 'longhornio/longhorn-engine:master-head',
'failedAt': '',
'hostId': 'ip-10-0-2-97',
'instanceManagerName': '',
'mode': '',
'name': 'longhorn-testvol-0gbgrf-r-319ab5fc',
'running': False
}, {
'address': '',
'backendStoreDriver': 'v1',
'currentImage': '',
'dataPath': '/var/lib/longhorn/replicas/longhorn-testvol-0gbgrf-c166e1b5',
'diskID': '4d810ed2-9514-4b04-9559-0b2f6d830310',
'diskPath': '/var/lib/longhorn/',
'engineImage': 'longhornio/longhorn-engine:master-head',
'failedAt': '',
'hostId': 'ip-10-0-2-213',
'instanceManagerName': '',
'mode': '',
'name': 'longhorn-testvol-0gbgrf-r-895589ea',
'running': False
}],
'restoreInitiated': False,
'restoreRequired': False,
'restoreStatus': [],
'restoreVolumeRecurringJob': 'ignored',
'revisionCounterDisabled': False,
'robustness': 'unknown',
'shareEndpoint': '',
'shareState': '',
'size': '2147483648',
'snapshotDataIntegrity': 'ignored',
'staleReplicaTimeout': 0,
'standby': False,
'state': 'detached',
'unmapMarkSnapChainRemoved': 'ignored',
'volumeAttachment': {
'attachments': {},
'volume': 'longhorn-testvol-0gbgrf'
}
}
To Reproduce
Excute test case test_recurring_jobs_when_volume_detached_unexpectedly
Expected behavior
We should have consistent test results on all distro.
Support bundle for troubleshooting
longhorn-tests-regression-4900-bundle (1).zip
Environment
- Longhorn version: master-head
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
Kubectl - Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
v1.28.1+k3s1- Number of management node in the cluster:
1 - Number of worker node in the cluster:
3
- Number of management node in the cluster:
- Node config
- OS type and version:
SLES 15-sp5 - Kernel version:
- CPU per node:
4 cores - Memory per node:
16GB - Disk type(e.g. SSD/NVMe/HDD):
SSD - Network bandwidth between the nodes:
- OS type and version:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
AWS - Number of Longhorn volumes in the cluster:
1 - Impacted Longhorn resources:
- Volume names:
longhorn-testvol-0gbgrf
- Volume names:
Additional context
N/A
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 2
- Comments: 18 (11 by maintainers)
Hi @nitendra-suse @innobead I triggered the test case 60 times, and all tests passed successfully. I believe that @ChanYiLin has already resolved the issue.
You can review the test results at the following links: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5013/testReport/tests/test_recurring_job/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5014/testReport/tests/test_recurring_job/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5015/testReport/tests/test_recurring_job/
@nitendra-suse Letโs move back to the implement pipeline.
@roger-ryao as you are the original reporter, please check the verification result by @nitendra-suse .
Here is the log after the fix You can see that volume needed some time to update its state to detaching, before that, the attachment ticket of backup controller was true After that, attachment controller would then update the satisfied to false.
I believe so. User can only take backup of attached volume from UI. In this case, since the volume is already attached, volume would not be detached by snapshot-controller and volume-attachment-controller after the creation of snapshot. So there is no detaching and attaching between taking snapshot and taking backup.
This issue happens because the volume is detaching by previous controller and attaching by the next controller. There is a chance that the attachment ticket of the next controller might get satisfied immediately due to the volume is not fully detached (state is not updated yet). Thus the following code logic gets mislead.
The flow of the test is
faultedstate and will come back in a while (volume would detach)The root cause of the issue can be found from my test with
Debugmessage Here are the logsThe race condition flow
v.Status.State = longhorn.VolumeStateDetachingand set thee.Spec.DesiredState = stoppedhereReplica->Engine->VolumehereEnginespec and statuse.Spec.DesiredState=stoppedande.Status.CurrentState=runningcheckMonitor()thatI think there might have more race condition case for engine/volume/volumeattachment statue like this one. We might need to review and see if we can improve from
volumeattachmentside, so it wonโt satisfied just because the volume is not fully detached from the previous ticket? cc @PhanLe1010 @derekbit @innobead @roger-ryao @longhorn/dev-control-planeLet me run again on my system.Thanks.
Hi @innobead volume attachment is only introduced after 1.5.0, we donโt have to backport to v1.4.x