longhorn: [BUG] Failed to `check_volume_data` after volume engine upgrade/migration
Describe the bug (🐛 if you encounter this issue)
The following test cases failed on master-head:
- test_engine_live_upgrade_with_backing_image
- test_engine_live_upgrade
- test_auto_upgrade_engine_to_default_version
- test_migration_confirm
- test_migration_with_unscheduled_replica
- test_migration_with_failed_replica
- test_migration_with_restore_volume
@derekbit and I tested a private longhorn-manager image and found that it seems to be a side effect of longhorn/longhorn-manager#2363.
> migration_confirm_test(clients, volume_name)
test_migration.py:40:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test_migration.py:65: in migration_confirm_test
check_detached_volume_data(client, volume_name, data)
test_migration.py:496: in check_detached_volume_data
common.check_volume_data(volume, data)
common.py:2527: in check_volume_data
check_device_data(dev, data, check_checksum)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dev = '/dev/longhorn/longhorn-testvol-2hmj6g'
data = {'checksum': '0048e07f0e9a2f23ba38c172ca91441c1223f98a39bb992c9434b40044c631c929af781623c5d72787d01052ac1a57a6f5346f02...rnk28nzi219z267btblxs63je7x4aofwa5jy45xbu50jsm6h2ch9mipi3fz1pg8q1omhjt9rw6dwc0iro06us93ms', 'len': 512, 'pos': 5199719}
check_checksum = True
def check_device_data(dev, data, check_checksum=True):
r_data = dev_read(dev, data['pos'], data['len'])
print(f"Device: {dev}, Position: {data['pos']}, Length: {data['len']}")
print(f"Actual data type: {type(r_data)}, Expected data type: {type(bytes(data['content'], encoding='utf8'))}")
print(f"Actual data content: {r_data}")
print(f"Expected data content: {bytes(data['content'], encoding='utf8')}")
> assert r_data == bytes(data['content'], encoding='utf8')
E AssertionError
To Reproduce
Rerun the job with private image
Expected behavior
Have consistent tested result in all branches result
Support bundle for troubleshooting
longhorn-tests-sles-amd64-759-bundle.zip
Environment
- Longhorn version: master-head
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Kubectl)
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: K3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
- Node config
- OS type and version: SLES 15-sp5
- Kernel version:
- CPU per node:
- Memory per node:
- Disk type(e.g. SSD/NVMe/HDD):
- Network bandwidth between the nodes:
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): AWS
- Number of Longhorn volumes in the cluster:
- Impacted Longhorn resources:
- Volume names:
Additional context
About this issue
- Original URL
- State: closed
- Created 6 months ago
- Reactions: 1
- Comments: 17 (11 by maintainers)
Commits related to this issue
- Fix: data lose during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to PhanLe1010/longhorn-manager by PhanLe1010 6 months ago
- Fix: data lost during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to PhanLe1010/longhorn-manager by PhanLe1010 6 months ago
- Fix: data lost during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to PhanLe1010/longhorn-manager by PhanLe1010 6 months ago
- Fix: data lost during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to longhorn/longhorn-manager by PhanLe1010 6 months ago
- Fix: data lost during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to longhorn/longhorn-manager by PhanLe1010 6 months ago
- Fix: data lost during engine upgrade/migration Do not cleanup replica folder if there are more than 1 replicas CR for it even if all replica CRs are in active. More details at https://github.com/lon... — committed to longhorn/longhorn-manager by PhanLe1010 6 months ago
It looks like the instance manager pod crashed right after the engine upgrade finish due to the function ProcessReplace() doesn’t initialize the Conditions map which causes the panic “assignment to entry in nil map” when getProcessToUpdateConditions() attempts to add a key to it.
Proposing PR https://github.com/longhorn/longhorn-instance-manager/pull/353 but still need sometime to test it
So the flow is engine upgrade init -> engine upgrade finish -> checkMountPointStatusForEngine begin -> panic -> instance-manager pod crash -> liveness probe failed -> data mount point is broken -> the test cannot read/read wrong data -> test failed
This also fits the finding of @james-munson that:
Thanks @roger-ryao and @james-munson
Analysis:
The PR https://github.com/longhorn/longhorn-manager/pull/2363 is indeed the root cause. We accidentally clean up the replica folder during the engine upgrading/live migrating process:
testvol-r-8571f4c4testvol-r-31debf74, is created to clone the replicatestvol-r-8571f4c4:truetofalseandtestvol-r-31debf74is switching fromfalsetotruetestvol-r-8571f4c4after the upgrade/migration:testvol-r-8571f4c4before issue a delete API request: linktestvol-r-8571f4c4, and new replica,testvol-r-31debf74, are inactive. Furthermore, the old replica,testvol-r-8571f4c4has deletion timestamptestvol-r-31debf74:testvol-r-31debf74, is killed and restarted. In the next start, it artificially recover the volume meta data and render volume as empty, a.k.a data has been lost:Additional context:
Why the engine upgrade tests as mentioned at https://github.com/longhorn/longhorn-manager/pull/2363#issuecomment-1855094264 but it fails now?
Looks like the PR was updated a bit after the test
Verified on master-head 20240101
The test steps
test_engine_upgrade.py::test_engine_live_upgrade_with_intensive_data_writingResult failed
engine live upgradewith intensive data writing, After the engine upgrade, we observed data inconsistency.https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5773/testReport/junit/tests/test_engine_upgrade/test_engine_live_upgrade_with_intensive_data_writing_1_10_/
support_buddle: longhorn-tests-regression-5773-bundle.zip
https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5773/testReport/junit/tests/test_engine_upgrade/test_engine_live_upgrade_with_intensive_data_writing_5_10_/
Hi @innobead : @PhanLe1010 is on PTO until 1/6. I think we need another team member’s help to check this issue.
Created an identical test environment and provided it to @PhanLe1010
The test passed with @PhanLe1010 's private image: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/5867/ and https://github.com/longhorn/longhorn-instance-manager/pull/353 has been merged
cc @ChanYiLin ☝️
The liveness probe is a symptom, not a direct cause. It tries to access /data/ to do its check and is rejected with a connection refused. In the first failure from the support bundle above, in ./logs/longhorn-system/longhorn-manager-5ltt5/longhorn-manager.log, shortly after the engine upgrade is complete, the instance manager monitor is already failing its check several seconds before the liveness probe event:
The logs for instance-manager-0811a3c3be2865140741c4de438628a9 are from later, after it has restarted several times during the test.
To isolate the longhorn-manager changes, I ran with builds at the following SHAs from before the regression failures started:
So it occurs with earlier code, which implies that it’s a combination of changes, not just a longhorn-manager thing.
And yet every run in v1.5.x branch from https://ci.longhorn.io/job/public/job/v1.5.x/job/v1.5.x-longhorn-tests-sles-amd64/248/ and after has been run on
Which is the backport of that same change:
But they have passed
test_engine_live_upgrade_with_intensive_data_writing. So this is a little confusing.Note - the long list of failures in https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/759/ was also seen in 760 and 761, but then reduced in 762 on Dec 22, apparently by the fix-to-the-fix in https://github.com/longhorn/longhorn-manager/pull/2392 Now it is only
test_engine_live_upgrade_with_intensive_data_writingthat fails, starting in run 767 on Dec 27, and since then. That test was run on longhorn-manager SHA b517601153190d762c88e38442e8037ff6b94391. The commits leading up to that were:The previous regression run, 766, did not fail
test_engine_live_upgrade_with_intensive_data_writing. It was based on SHAWhich suggests that the change in
update default images versionmight have made a difference.cc @PhanLe1010