longhorn: [BUG] Failed to `check_volume_data` after volume engine upgrade/migration

Describe the bug (🐛 if you encounter this issue)

The following test cases failed on master-head:

  • test_engine_live_upgrade_with_backing_image
  • test_engine_live_upgrade
  • test_auto_upgrade_engine_to_default_version
  • test_migration_confirm
  • test_migration_with_unscheduled_replica
  • test_migration_with_failed_replica
  • test_migration_with_restore_volume

@derekbit and I tested a private longhorn-manager image and found that it seems to be a side effect of longhorn/longhorn-manager#2363.

>       migration_confirm_test(clients, volume_name)

test_migration.py:40: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_migration.py:65: in migration_confirm_test
    check_detached_volume_data(client, volume_name, data)
test_migration.py:496: in check_detached_volume_data
    common.check_volume_data(volume, data)
common.py:2527: in check_volume_data
    check_device_data(dev, data, check_checksum)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dev = '/dev/longhorn/longhorn-testvol-2hmj6g'
data = {'checksum': '0048e07f0e9a2f23ba38c172ca91441c1223f98a39bb992c9434b40044c631c929af781623c5d72787d01052ac1a57a6f5346f02...rnk28nzi219z267btblxs63je7x4aofwa5jy45xbu50jsm6h2ch9mipi3fz1pg8q1omhjt9rw6dwc0iro06us93ms', 'len': 512, 'pos': 5199719}
check_checksum = True

    def check_device_data(dev, data, check_checksum=True):
        r_data = dev_read(dev, data['pos'], data['len'])
        print(f"Device: {dev}, Position: {data['pos']}, Length: {data['len']}")
        print(f"Actual data type: {type(r_data)}, Expected data type: {type(bytes(data['content'], encoding='utf8'))}")
        print(f"Actual data content: {r_data}")
        print(f"Expected data content: {bytes(data['content'], encoding='utf8')}")
>       assert r_data == bytes(data['content'], encoding='utf8')
E       AssertionError

Screenshot_20231220_170641

To Reproduce

Rerun the job with private image

Expected behavior

Have consistent tested result in all branches result

Support bundle for troubleshooting

longhorn-tests-sles-amd64-759-bundle.zip

Environment

  • Longhorn version: master-head
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl)
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 3
  • Node config
    • OS type and version: SLES 15-sp5
    • Kernel version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe/HDD):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
  • Number of Longhorn volumes in the cluster:
  • Impacted Longhorn resources:
    • Volume names:

Additional context

https://github.com/longhorn/longhorn/issues/7012

About this issue

  • Original URL
  • State: closed
  • Created 6 months ago
  • Reactions: 1
  • Comments: 17 (11 by maintainers)

Commits related to this issue

Most upvoted comments

It looks like the instance manager pod crashed right after the engine upgrade finish due to the function ProcessReplace() doesn’t initialize the Conditions map which causes the panic “assignment to entry in nil map” when getProcessToUpdateConditions() attempts to add a key to it.

[longhorn-instance-manager] time="2024-01-04T06:35:28Z" level=info msg="Process Manager: successfully replaced process longhorn-testvol-jogcok-e-0" func="process.(*Manager).ProcessReplace" file="process_manager.go:583"
[longhorn-instance-manager] time="2024-01-04T06:35:28Z" level=info msg="Process Manager: successfully unregistered process longhorn-testvol-jogcok-r-9b014111" func="process.(*Manager).unregisterProcess.func1" file="process_manager.go:343"
panic: assignment to entry in nil map

goroutine 14 [running]:
github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).getProcessToUpdateConditions(0xc00051c008?, 0xc000501cc0?)
	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:163 +0x4a5
github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).checkMountPointStatusForEngine(0xc000322800)
	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:147 +0x14a
github.com/longhorn/longhorn-instance-manager/pkg/process.(*Manager).startInstanceConditionCheck(0xc000322800)
	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:130 +0x125
created by github.com/longhorn/longhorn-instance-manager/pkg/process.NewManager in goroutine 1
	/go/src/github.com/longhorn/longhorn-instance-manager/pkg/process/process_manager.go:90 +0x3a5
➜  ~/.kube 

Proposing PR https://github.com/longhorn/longhorn-instance-manager/pull/353 but still need sometime to test it

So the flow is engine upgrade init -> engine upgrade finish -> checkMountPointStatusForEngine begin -> panic -> instance-manager pod crash -> liveness probe failed -> data mount point is broken -> the test cannot read/read wrong data -> test failed

This also fits the finding of @james-munson that:

Thanks @roger-ryao and @james-munson

Analysis:

The PR https://github.com/longhorn/longhorn-manager/pull/2363 is indeed the root cause. We accidentally clean up the replica folder during the engine upgrading/live migrating process:

  1. Tracking one replica during engine upgrade/migration process, testvol-r-8571f4c4
  2. A new inactive replica, testvol-r-31debf74, is created to clone the replica testvol-r-8571f4c4:
    2023-12-20T21:57:36.524937622Z time="2023-12-20T21:57:36Z" level=info msg="Cloned a new matching replica testvol-r-31debf74 from testvol-r-8571f4c4" func="controller.(*VolumeController).createAndStartMatchingReplicas" file="volume_controller.go:3739" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=phan-v607-upgrade-fail-pool2-3be6d107-gjwkw owner=phan-v607-upgrade-fail-pool2-3be6d107-gjwkw state=attached volume=testvol
    
  3. Next we switch the active field of the replicas as a step near the end of the upgrade/migation:
    2023-12-20T21:57:40.716530264Z time="2023-12-20T21:57:40Z" level=info msg="Switching replica testvol-r-8571f4c4 active state from true to false due to longhornio/longhorn-engine:master-head" func="controller.(*VolumeController).switchActiveReplicas" file="volume_controller.go:3770"
    2023-12-20T21:57:40.716859020Z time="2023-12-20T21:57:40Z" level=info msg="Switching replica testvol-r-31debf74 active state from false to true due to longhornio/longhorn-engine:master-head" func="controller.(*VolumeController).switchActiveReplicas" file="volume_controller.go:3770"
    
    So testvol-r-8571f4c4 is switching from true to false and testvol-r-31debf74 is switching from false to true
  4. Note that the replica switching are only in the local map, it has not been persisted to the ETCD yet: link
  5. Then, we remove the inactive replica testvol-r-8571f4c4 after the upgrade/migration:
    2023-12-20T21:57:40.887033403Z time="2023-12-20T21:57:40Z" level=info msg="Removing inactive replica testvol-r-8571f4c4" func="controller.(*VolumeController).cleanupCorruptedOrStaleReplicas" file="volume_controller.go:925" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=phan-v607-upgrade-fail-pool2-3be6d107-gjwkw owner=phan-v607-upgrade-fail-pool2-3be6d107-gjwkw state=attached volume=testvol
    
  6. Note that in the deleteReplica() function, it update the replica testvol-r-8571f4c4 before issue a delete API request: link
  7. At this point, from ETCD perspective, both old replica, testvol-r-8571f4c4, and new replica, testvol-r-31debf74, are inactive. Furthermore, the old replica, testvol-r-8571f4c4 has deletion timestamp
  8. This condition trigger the replica controller to immediately clean up the replica folder: link
    2023-12-20T21:57:41.050750098Z time="2023-12-20T21:57:41Z" level=info msg="Cleaning up replica" func="controller.(*ReplicaController).syncReplica" file="replica_controller.go:259" controller=longhorn-replica node=phan-v607-upgrade-fail-pool2-3be6d107-v9rvz nodeID=phan-v607-upgrade-fail-pool2-3be6d107-v9rvz ownerID=phan-v607-upgrade-fail-pool2-3be6d107-v9rvz replica=testvol-r-8571f4c4
    
  9. The fix here would be do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active
  10. Now that replica folder are cleaned up, the new replica cannot find the data folder, testvol-r-31debf74:
    2023-12-20T21:57:46.298524971Z [testvol-r-31debf74] time="2023-12-20T21:57:46Z" level=error msg="Failed to get size of file /host/var/lib/longhorn/replicas/testvol-7159747e/volume-head-000.img" func=util.GetFileActualSize file="util.go:209" error="no such file or directory"
    
  11. The new replica, testvol-r-31debf74, is killed and restarted. In the next start, it artificially recover the volume meta data and render volume as empty, a.k.a data has been lost:
2023-12-20T21:57:52.980648456Z [testvol-r-31debf74] time="2023-12-20T21:57:52Z" level=info msg="Creating replica /host/var/lib/longhorn/replicas/testvol-7159747e, size 5368709120/512" func="replica.(*Server).Create" file="server.go:52"
2023-12-20T21:57:52.980669401Z time="2023-12-20T21:57:52Z" level=warning msg="Recovering volume metafile /host/var/lib/longhorn/replicas/testvol-7159747e/volume.meta, and replica info: {Size:5368709120 Head: Dirty:false Rebuilding:false Error: Parent: SectorSize:512 BackingFilePath: BackingFile:<nil>}" func="replica.(*Replica).tryRecoverVolumeMetaFile" file="replica.go:1088"

Additional context:

Why the engine upgrade tests as mentioned at https://github.com/longhorn/longhorn-manager/pull/2363#issuecomment-1855094264 but it fails now?

Looks like the PR was updated a bit after the test

Verified on master-head 20240101

The test steps

  1. Execute the test case test_engine_upgrade.py::test_engine_live_upgrade_with_intensive_data_writing

Result failed

  1. The test for engine live upgrade with intensive data writing, After the engine upgrade, we observed data inconsistency.

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5773/testReport/junit/tests/test_engine_upgrade/test_engine_live_upgrade_with_intensive_data_writing_1_10_/

support_buddle: longhorn-tests-regression-5773-bundle.zip

volume_name = 'longhorn-testvol-1ujmvd'
        volume_file_md5sum1 = get_pod_data_md5sum(
            core_api, pod_name, data_path1)
>       assert volume_file_md5sum1 == original_md5sum1
E       assert "md5sum: can't read '/data/test1': Input/output error\n" == '22238c48418b3332be491f6e3fd50276\n'
E         - md5sum: can't read '/data/test1': Input/output error
E         + 22238c48418b3332be491f6e3fd50276

test_engine_upgrade.py:726: AssertionError

https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5773/testReport/junit/tests/test_engine_upgrade/test_engine_live_upgrade_with_intensive_data_writing_5_10_/

volume_name = 'longhorn-testvol-f6x4cn'
        volume_file_md5sum1 = get_pod_data_md5sum(
            core_api, pod_name, data_path1)
>       assert volume_file_md5sum1 == original_md5sum1
E       AssertionError: assert '3de16b1c2fdded2dba4dd218fb8cfd9e\n' == 'f20f790ed460278b53cba8bd3a49a488\n'
E         - 3de16b1c2fdded2dba4dd218fb8cfd9e
E         + f20f790ed460278b53cba8bd3a49a488

test_engine_upgrade.py:747: AssertionError

Hi @innobead : @PhanLe1010 is on PTO until 1/6. I think we need another team member’s help to check this issue.

Looks like there are something special about the test cluster. The instance-manager are being restared due to failed liveness probe which inturn crashed the volume and failed the test:

    uid: 837ef9c1-25cd-4615-8d48-53651b4b5e75
  note: 'Liveness probe failed: dial tcp 10.42.3.39:8500: connect: connection refused'

@roger-ryao is helping to create an identical cluster as the test pipeline

Created an identical test environment and provided it to @PhanLe1010

test_engine_upgrade.py::test_engine_live_upgrade_with_intensive_data_writing FAILED

======================================================= FAILURES =======================================================
_________________________________ test_engine_live_upgrade_with_intensive_data_writing _________________________________

client = <longhorn.Client object at 0x7fb490af6f10>
core_api = <kubernetes.client.api.core_v1_api.CoreV1Api object at 0x7fb490953a90>
volume_name = 'longhorn-testvol-1yn9ei', pod_make = <function pod_make.<locals>.make_pod at 0x7fb490951d30>

        create_and_wait_pod(core_api, pod)
        common.wait_for_volume_healthy(client, volume_name)
    
        volume_file_md5sum1 = get_pod_data_md5sum(
            core_api, pod_name, data_path1)
>       assert volume_file_md5sum1 == original_md5sum1
E       AssertionError: assert 'ff479b932f4d...8709e85178b\n' == 'ef6c636240e9...62c7a73e2dc\n'
E         - ff479b932f4d89097a3d98709e85178b
E         + ef6c636240e910fd675a362c7a73e2dc

test_engine_upgrade.py:747: AssertionError
============================================ 1 failed in 295.17s (0:04:55) =============================================

cc @ChanYiLin ☝️

The liveness probe is a symptom, not a direct cause. It tries to access /data/ to do its check and is rejected with a connection refused. In the first failure from the support bundle above, in ./logs/longhorn-system/longhorn-manager-5ltt5/longhorn-manager.log, shortly after the engine upgrade is complete, the instance manager monitor is already failing its check several seconds before the liveness probe event:

2024-01-01T09:01:52.024224981Z time="2024-01-01T09:01:52Z" level=info msg="Engine longhorn-testvol-1ujmvd-e-0 has been upgraded from longhornio/longhorn-engine:master-head to longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1" func="controller.(*VolumeController).finishLiveEngineUpgrade" file="volume_controller.go:2877" accessMode=rwo controller=longhorn-volume engine=longhorn-testvol-1ujmvd-e-0 frontend=blockdev migratable=false node=ip-10-0-2-141 owner=ip-10-0-2-141 state=attached volume=longhorn-testvol-1ujmvd volumeDesiredEngineImage="longhornio/longhorn-test:upgrade-test.9-8.5-4.1-1"

2024-01-01T09:01:53.546695089Z time="2024-01-01T09:01:53Z" level=error msg="Failed to receive next item in instance watch" func="controller.(*InstanceManagerMonitor).Run.func2" file="instance_manager_controller.go:1310" controller=longhorn-instance-manager error="rpc error: code = Unavailable desc = error reading from server: EOF" instance manager=instance-manager-0811a3c3be2865140741c4de438628a9 node=ip-10-0-2-141
2024-01-01T09:01:54.547850667Z time="2024-01-01T09:01:54Z" level=error msg="Failed to receive next item in instance watch" func="controller.(*InstanceManagerMonitor).Run.func2" file="instance_manager_controller.go:1310" controller=longhorn-instance-manager error="rpc error: code = Unavailable desc = error reading from server: EOF" instance manager=instance-manager-0811a3c3be2865140741c4de438628a9 node=ip-10-0-2-141

2024-01-01T09:01:55.984904530Z W0101 09:01:55.984782       1 logging.go:59] [core] [Channel #121 SubChannel #122] grpc: addrConn.createTransport failed to connect to {Addr: "10.42.1.9:8501", ServerName: "10.42.1.9:8501", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.1.9:8501: connect: connection refused"
      name: instance-manager-0811a3c3be2865140741c4de438628a9
      namespace: longhorn-system
      resourceVersion: "1308"
      uid: 02bc32f4-6918-45a8-8147-02adf09cf391
    kind: Event
    lastTimestamp: "2024-01-01T09:01:55Z"
    message: 'Liveness probe failed: dial tcp 10.42.1.9:8500: connect: connection refused'

The logs for instance-manager-0811a3c3be2865140741c4de438628a9 are from later, after it has restarted several times during the test.

To isolate the longhorn-manager changes, I ran with builds at the following SHAs from before the regression failures started:

jmunson4suse/longhorn-manager:a42748a4 <- failed with md5 mismatch.
jmunson4suse/longhorn-manager:bece02b4 <- failed with md5 mismatch.
jmunson4suse/longhorn-manager:dead2b7d <- failed to set up test, error in reset_engine_image.
jmunson4suse/longhorn-manager:2fb03c11 <- failed to set up test, error in reset_engine_image.

So it occurs with earlier code, which implies that it’s a combination of changes, not just a longhorn-manager thing.

And yet every run in v1.5.x branch from https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/248/ and after has been run on

longhorn-manager version: be400cf167198f7a7b438df3de49c2b5df5842c5 

Which is the backport of that same change:

* be400cf1 (HEAD -> v1.5.x, origin/v1.5.x) chore(csi): update default images version
* 8a8b11ab fix(log): mismatched args
* ffc99fab Add a startup probe to the longhorn-csi-plugin container
* 3edd68d7 Increase ignoreKubeletNotReadyTime to 15 seconds
* 4e000bef Fix: data lost during engine upgrade/migration
...

But they have passed test_engine_live_upgrade_with_intensive_data_writing. So this is a little confusing.

Note - the long list of failures in https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/759/ was also seen in 760 and 761, but then reduced in 762 on Dec 22, apparently by the fix-to-the-fix in https://github.com/longhorn/longhorn-manager/pull/2392 Now it is only test_engine_live_upgrade_with_intensive_data_writing that fails, starting in run 767 on Dec 27, and since then. That test was run on longhorn-manager SHA b517601153190d762c88e38442e8037ff6b94391. The commits leading up to that were:

* b5176011 chore(csi): update default images version
* a42748a4 test(enforce-engine-upgrade): add volume/engine/replica validation
* bece02b4 refactor(enforce-engine-upgrade): remove hardcoded engine versions
* dead2b7d test(enforce-engine-upgrade): checkEngineUpgradePathSupported
* 3ebdb55d feat(enforce-engine-upgrade): check upgrade path
* 680b0f36 feat(enforce-engine-upgrade): vendor update
* 2fb03c11 chore: update longhron-instance-manager

The previous regression run, 766, did not fail test_engine_live_upgrade_with_intensive_data_writing. It was based on SHA

longhorn-manager version: a42748a41d2d79f3be58e78d808859e6456576cf 

Which suggests that the change in update default images version might have made a difference.