longhorn: [TEST] test_csi_umount_when_longhorn_block_device_is_disconnected_unexpectedly failed on v1.4.x pipeline

What’s the test to develop? Please describe

Test case test_csi_umount_when_longhorn_block_device_is_disconnected_unexpectedly worked well on master and v1.5.x pipeline but continue failed on v1.4.x

def finalizer():
        api = get_core_api_client()
        client = get_longhorn_api_client()
>       delete_and_wait_statefulset(api, client, statefulset_manifest)

common.py:1390: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:632: in delete_and_wait_statefulset
    wait_delete_pod(api, pod['pod_uid'])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0xffff6b6bc460>
pod_uid = 'a682da42-1a1d-451f-b72a-15c50124b56a', namespace = 'default'

    def wait_delete_pod(api, pod_uid, namespace='default'):
        for i in range(DEFAULT_POD_TIMEOUT):
            ret = api.list_namespaced_pod(namespace=namespace)
            found = False
            for item in ret.items:
                if item.metadata.uid == pod_uid:
                    found = True
                    break
            if not found:
                break
            time.sleep(DEFAULT_POD_INTERVAL)
>       assert not found
E       AssertionError

common.py:915: AssertionError

Describe the tasks for the test

item1

Additional context

N/A

About this issue

Original URL
State: closed
Created a year ago
Comments: 17 (16 by maintainers)

Most upvoted comments

This test case with v1.4.x images will fail when kubernetes version is v1.25.3+k3s1(v1.4.x pipeline currently using), pass when kubernetes version is v1.27.1+k3s1(which is currently v1.5.x and mater pipeline using)

chriscchien on Oct 23, 2023

I think there is an issue here.

In the step 3 and step 4 of this test case, it should wait for pod actually terminated before go to the next step:

    3. Delete the workload pod
    4. Verify that the pod is able to terminated and a new pod is able
        start

But in the current implementation, it doesn’t wait for the pod termination:

    crash_engine_process_with_sigkill(client, core_api,
                                      sspod_info['pv_name'])
    delete_and_wait_pod(core_api,
                        pod_name=sspod_info['pod_name'],
                        wait=False)

If we remove the wait=False option in the delete_and_wait_pod to let it wait for the pod termination, we can see this test case failed at this step instead of the tear down function:

=================================== FAILURES ===================================
___ test_csi_umount_when_longhorn_block_device_is_disconnected_unexpectedly ____

client = <longhorn.Client object at 0x7f4f88be3640>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7f4f88a03a30>
statefulset = {'apiVersion': 'apps/v1', 'kind': 'StatefulSet', 'metadata': {'name': 'block-device-disconnect-unexpectedly-test', 'na...usybox:1.34.0', 'imagePullPolicy': 'IfNotPresent', 'name': 'sleep', ...}], 'terminationGracePeriodSeconds': 10}}, ...}}
storage_class = {'allowVolumeExpansion': True, 'apiVersion': 'storage.k8s.io/v1', 'kind': 'StorageClass', 'metadata': {'name': 'longhorn-test'}, ...}

    @pytest.mark.csi  # NOQA
    def test_csi_umount_when_longhorn_block_device_is_disconnected_unexpectedly(client, core_api, statefulset, storage_class):  # NOQA
        """
        Test CSI umount when Longhorn block device is disconnected unexpectedly
    
        GitHub ticket: https://github.com/longhorn/longhorn/issues/3778
    
        1. Deloy a statefulset that has volumeClaimTemplates with
            volumeMode: Block
        2. Crash the engine process of the volume to simulate Longhorn block
            device is disconnected unexpectedly
        3. Delete the workload pod
        4. Verify that the pod is able to terminated and a new pod is able
            start
        """
        device_path = "/dev/longhorn/longhorn-test-blk"
        statefulset['spec']['template']['spec']['containers'][0]['volumeMounts'] = [] # NOQA
        statefulset['spec']['template']['spec']['containers'][0]['volumeDevices'] = [ # NOQA
            {'name': 'pod-data', 'devicePath': device_path}
        ]
        statefulset['spec']['volumeClaimTemplates'][0]['spec']['volumeMode'] = 'Block' # NOQA
        statefulset['spec']['replicas'] = 1
        statefulset_name = 'block-device-disconnect-unexpectedly-test'
        update_statefulset_manifests(statefulset,
                                     storage_class,
                                     statefulset_name)
    
        create_storage_class(storage_class)
        create_and_wait_statefulset(statefulset)
        sspod_info = get_statefulset_pod_info(core_api, statefulset)[0]
    
        crash_engine_process_with_sigkill(client, core_api,
                                          sspod_info['pv_name'])
>       delete_and_wait_pod(core_api,
                            pod_name=sspod_info['pod_name'])

test_kubernetes.py:820: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common.py:602: in delete_and_wait_pod
    wait_delete_pod(api, target_pod.metadata.uid, namespace=namespace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7f4f88a03a30>
pod_uid = '86da79d4-2b50-4ce5-a4eb-4fc677b77708', namespace = 'default'

    def wait_delete_pod(api, pod_uid, namespace='default'):
        for i in range(DEFAULT_POD_TIMEOUT):
            ret = api.list_namespaced_pod(namespace=namespace)
            found = False
            for item in ret.items:
                if item.metadata.uid == pod_uid:
                    found = True
                    break
            if not found:
                break
            time.sleep(DEFAULT_POD_INTERVAL)
>       assert not found
E       AssertionError

common.py:940: AssertionError

Which means the pod is not able to be terminated, and definitely not what the behavior the test case expects to.

Unless this test case is not applicable in k8s version <= 1.25, otherwise there is an issue here.

cc @longhorn/qa @innobead

yangchiu on Oct 24, 2023

@yangchiu probably we can also update the K8s version for 1.4?

innobead on Oct 26, 2023

This is an upstream issue, https://github.com/longhorn/longhorn/issues/3778#issuecomment-1572999855.

@chriscchien you verified it before 😃, the fix was just fixed into K8s 1.27, so the failure here is expected.

ref: https://github.com/longhorn/longhorn-tests/pull/1475

innobead on Oct 26, 2023

This can reproduced locally with 1.4.4-rc1 according to @nitendra-suse cc: @innobead @longhorn/qa

khushboo-rancher on Oct 19, 2023