longhorn: [BUG] Volume could not be remounted after engine process killed

Describe the bug

In test case test_volume_reattach_after_engine_sigkill:

"""
[HA] Test if the volume can be reattached after using SIGKILL
to crash the engine process

1. Create StorageClass and StatefulSet.
2. Write random data to the pod and get the md5sum.
3. Crash the engine process by SIGKILL in the engine manager.
4. Wait for volume to `faulted`, then `healthy`.
5. Wait for K8s to terminate the pod and statefulset to bring pod to
    `Pending`, then `Running`.
6. Check volume path exist in the pod.
7. Check md5sum of the data in the pod.
8. Check new data written to the volume is successful.
"""

After engine process killed in step 3, and volume recovered to healthy in step 4, the statefulset pod could not be recreated to remount the volume:

https://ci.longhorn.io/job/public/job/v1.6.x/job/v1.6.x-longhorn-upgrade-tests-sles-arm64/8/testReport/junit/tests/test_ha/test_volume_reattach_after_engine_sigkill/ https://ci.longhorn.io/job/public/job/master/job/sles/job/arm64/job/longhorn-2-stage-upgrade-tests-sles-arm64/421/testReport/junit/tests/test_ha/test_volume_reattach_after_engine_sigkill/

client = <longhorn.Client object at 0xffff7d497af0>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0xffff7d6c35e0>
storage_class = {'allowVolumeExpansion': True, 'apiVersion': 'storage.k8s.io/v1', 'kind': 'StorageClass', 'metadata': {'name': 'longhorn-test'}, ...}
sts_name = 'longhorn-teststs-hhecdb'
statefulset = {'apiVersion': 'apps/v1', 'kind': 'StatefulSet', 'metadata': {'name': 'longhorn-teststs-hhecdb', 'namespace': 'default...usybox:1.34.0', 'imagePullPolicy': 'IfNotPresent', 'name': 'sleep', ...}], 'terminationGracePeriodSeconds': 10}}, ...}}

    def test_volume_reattach_after_engine_sigkill(client, core_api, storage_class, sts_name, statefulset):  # NOQA
        """
        [HA] Test if the volume can be reattached after using SIGKILL
        to crash the engine process
    
        1. Create StorageClass and StatefulSet.
        2. Write random data to the pod and get the md5sum.
        3. Crash the engine process by SIGKILL in the engine manager.
        4. Wait for volume to `faulted`, then `healthy`.
        5. Wait for K8s to terminate the pod and statefulset to bring pod to
           `Pending`, then `Running`.
        6. Check volume path exist in the pod.
        7. Check md5sum of the data in the pod.
        8. Check new data written to the volume is successful.
        """
        vol_name, pod_name, md5sum = \
            common.prepare_statefulset_with_data_in_mb(
                client, core_api, statefulset, sts_name, storage_class)
    
        crash_engine_process_with_sigkill(client, core_api, vol_name)
    
>       wait_pod_for_remount_request(client, core_api, vol_name, pod_name, md5sum)

test_ha.py:1863: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_ha.py:625: in wait_pod_for_remount_request
    wait_for_pod_remount(core_api, pod_name)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0xffff7d6c35e0>
pod_name = 'longhorn-teststs-hhecdb-0', chk_path = '/data/lost+found'

    def wait_for_pod_remount(core_api, pod_name, chk_path="/data/lost+found"):
        check_command = [
            '/bin/sh',
            '-c',
            'ls ' + chk_path
        ]
    
        ready = False
        for i in range(RETRY_EXEC_COUNTS):
            try:
                output = stream(core_api.connect_get_namespaced_pod_exec,
                                pod_name,
                                'default',
                                command=check_command,
                                stderr=True, stdin=False,
                                stdout=True, tty=False)
                if "Input/output error" not in output:
                    ready = True
                    break
            except Exception:
                pass
            if ready:
                break
            time.sleep(RETRY_EXEC_INTERVAL)
>       assert ready
E       AssertionError

common.py:4932: AssertionError

Seems longhorn-manager is flooded with Volume request remount messages:

remount

2024-01-23T00:54:07.193269120Z time="2024-01-23T00:54:07Z" level=info msg="Volume request remount at 2024-01-23T00:54:07Z due to engine detected read-only filesystem" func="controller.(*VolumeController).requestRemountIfFileSystemReadOnly" file="volume_controller.go:1482" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=ip-10-0-2-138 owner=ip-10-0-2-138 state=attached volume=pvc-c0387c67-ce3e-4ce0-a63b-465eeda79781
2024-01-23T00:54:07.193385769Z time="2024-01-23T00:54:07Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-c0387c67-ce3e-4ce0-a63b-465eeda79781\", UID:\"6c272859-9b99-4a8c-b823-532228049b5d\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"153277\", FieldPath:\"\"}): type: 'Normal' reason: 'Remount' Volume pvc-c0387c67-ce3e-4ce0-a63b-465eeda79781 requested remount at 2024-01-23T00:54:07Z due to engine detected read-only filesystem" func="record.(*eventBroadcasterImpl).StartLogging.func1" file="event.go:298"

But the statefulset pod is never recreated and stuck in Input/output error:

# kubectl get pods 
NAME                        READY   STATUS    RESTARTS      AGE
longhorn-test-minio         1/1     Running   0             18h
longhorn-test-nfs           1/1     Running   1 (18h ago)   18h
longhorn-test               2/2     Running   0             18h
longhorn-teststs-yoij4o-0   1/1     Running   0             13h
# kubectl exec -it longhorn-teststs-yoij4o-0 -- /bin/sh
/data # echo ok > ok
/bin/sh: can't create ok: Input/output error

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6041/console

To Reproduce

Run test case test_volume_reattach_after_engine_sigkill.

The reproducibility is about ~ 1/300.

Expected behavior

Support bundle for troubleshooting

supportbundle_846e2484-667c-4742-b6ce-dbf5fd40a0df_2024-01-23T00-54-01Z.zip

Environment

Longhorn version: master-head, v1.6.x-head
Impacted volume (PV):
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: v1.28.4+k3s1
- Number of control plane nodes in the cluster:
- Number of worker nodes in the cluster:
Node config
- OS type and version:
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type (e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes (Gbps):
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
Number of Longhorn volumes in the cluster:

Additional context

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 19 (18 by maintainers)

Most upvoted comments

I think I confirmed the issue It is like what @shuo-wu said here

Take this case for example

test_ha.py::test_volume_reattach_after_engine_sigkill[41-100] 
Exception waiting for pod pending,could have missed it
wait for pod longhorn-teststs-ca4bzq-0 remount ... (0)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (1)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (2)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (3)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (4)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (5)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (6)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (7)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (8)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (9)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (10)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (11)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (12)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (13)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (14)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (15)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (16)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (17)
wait for pod longhorn-teststs-ca4bzq-0 remount ... (18)
...

We crash the engine and then wait the volume become faulty first and then become health
We then wait the pod become pending first and then become running (Restarting)
And then check if volume is remounted by checking the IO

 def wait_pod_for_remount_request(...):
    try:
        # this line may fail if the recovery is too quick
        wait_for_volume_faulted(client, volume_name)
    except AssertionError:
        print("\nException waiting for volume faulted,"
              "could have missed it")

    wait_for_volume_healthy(client, volume_name)

    try:
        common.wait_for_pod_phase(core_api, pod_name, pod_phase="Pending")
    except AssertionError:
        print("\nException waiting for pod pending,"
              "could have missed it")
    common.wait_for_pod_phase(core_api, pod_name, pod_phase="Running")

    wait_for_pod_remount(core_api, pod_name)
    ...

def wait_for_pod_remount(core_api, pod_name, chk_path="/data/lost+found"):
    check_command = [
        '/bin/sh',
        '-c',
        'ls ' + chk_path
    ]

    ready = False
    for i in range(RETRY_EXEC_COUNTS):
        try:
            output = stream(core_api.connect_get_namespaced_pod_exec,
                            pod_name,
                            'default',
                            command=check_command,
                            stderr=True, stdin=False,
                            stdout=True, tty=False)
            if "Input/output error" not in output:
                ready = True
                break
        except Exception:
            pass
        if ready:
            break
        time.sleep(RETRY_EXEC_INTERVAL)
    assert ready

The problem here is that

The engine indeed restarted and the volume became healthy
The volume-controller request remount because volume detach accidentally here
but the pod hadn’t yet. And our test thought it just missed the restart process and assumed it restarted because it was now running (See the exception we catch in the test)
Our test then performed some IO which made the filesystem read-only
Engine detect the read-only, and then request remount every 30 sec (CRD reconcile time interval)

Our Pod deletion is executed in kubernetes_pod_controller.go and it reconcile pods every 30 secs If request remount every 30 sec just overlaps with reconcile pods every 30 secs, then it won’t delete the pod because we check podStartTime.Before(remountRequestedAt) && timeNow.After(remountRequestedAt.Add(delayDuration))

cc @shuo-wu @innobead @yangchiu

ChanYiLin on Jan 24, 2024

I am fine with either. BTW, requeuing an object will add a backoff. There won’t be too much overhead.

shuo-wu on Jan 24, 2024

Verified passed on v1.6.x-head (longhorn-manager 706870) by running test case test_volume_reattach_after_engine_sigkill 300 times.

They all passed: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6102/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6103/ https://ci.longhorn.io/job/private/job/longhorn-tests-regression/6105/

yangchiu on Jan 25, 2024

I think what we can do here is to not update the request remount every time when we detect read-only condition Maybe we update it every 1 mins so pod reconciliation could still have the chance to delete the pod

Cleanup filesystem readonly condition after set request remount won’t work, because we periodically sync the condition from instance-manager every 30 secs.

ChanYiLin on Jan 24, 2024

I think it is possible that the v.Status.RemountRequestedAt keeps updating every reconcile I will do a quick test

ChanYiLin on Jan 23, 2024