longhorn: [BUG] The last healthy replica may be evicted or removed

Describe the bug

test_disk_eviction_with_node_level_soft_anti_affinity_disabled failed in master-head edc1b83 Double verified in release version, the fail situation not happen on V1.3.0

To Reproduce

Steps to reproduce the behavior:

Setup longhorn with 3 nodes
Deploy longhorn-test
Run test_disk_eviction_with_node_level_soft_anti_affinity_disabled
After test steps 6, volume will keep in attaching state and no replica exist

Expected behavior

Test case should pass

Log or Support bundle

longhorn-support-bundle_35fabdcc-d73a-4168-a2dd-65c2298709b1_2022-07-15T06-48-21Z.zip

Environment

Longhorn version: edc1b83
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): kubectl
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 3
Node config
- OS type and version: Ubuntu 20.04

Additional context

https://ci.longhorn.io/job/public/job/master/job/sles/job/amd64/job/longhorn-tests-sles-amd64/186/testReport/junit/tests/test_node/test_disk_eviction_with_node_level_soft_anti_affinity_disabled/

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 15 (14 by maintainers)

Most upvoted comments

Fixing issue 2 retaining the evicting replica if it's the only healthy replica would not resolve the test failure. During the replica removal for the eviction, other replicas may be not removed yet. On the other hand, retaining the last RW replica for ReplicaRemove API or eviction may not work. Since these 2 options can be executed simultaneously (it’s what the test case does) and there is no lock protecting the replicas.

I will continue the investigation tomorrow.

shuo-wu on Jul 27, 2022

For the test, it’s better to wait for the new replica rebuilding complete before removing other replicas.

IMO we shouldn’t add in the wait because it could potentially happen to users.

@c3y1huang @innobead Do you think Longhorn can retain the evicting replica if it’s the only healthy replica of the volume?

Sounds good to me.

c3y1huang on Jul 27, 2022

The cause of this issue:

The test sends an eviction request to Longhorn, then Longhorn creates a new replica before removing the replica in the evicting disk. This means the volume always keeps robustness Healthy during the eviction. And these checks cannot make the test wait for the new replica rebuilding complete at all before executing the next step.

# The volume status at the end of step6 or before step7. Some irrelevant info is removed.
{
   "controllers":[
      {
         "actualSize":"4096",
         "address":"10.42.2.104",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "endpoint":"/dev/longhorn/longhorn-testvol-lvv2mx",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "hostId":"shuo-k8s-worker-3",
         "instanceManagerName":"instance-manager-e-e331ff16",
         "isExpanding":false,
         "lastExpansionError":"",
         "lastExpansionFailedAt":"",
         "lastRestoredBackup":"",
         "name":"longhorn-testvol-lvv2mx-e-1bdf3b92",
         "requestedBackupRestore":"",
         "running":true,
         "size":"16777216"
      }
   ],

   "name":"longhorn-testvol-lvv2mx",

   "numberOfReplicas":3,
   "purgeStatus":[
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"longhorn-testvol-lvv2mx-r-77c53703",
         "state":"complete"
      },
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"longhorn-testvol-lvv2mx-r-bd146b02",
         "state":"complete"
      },
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"longhorn-testvol-lvv2mx-r-b5ecc458",
         "state":"complete"
      }
   ],

   "replicas":[
      {
         "address":"10.42.1.70",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/var/lib/longhorn/replicas/longhorn-testvol-lvv2mx-1ee96682",
         "diskID":"7e6676cb-884c-42ec-8f9c-e02bf99df065",
         "diskPath":"/var/lib/longhorn/",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-2",
         "instanceManagerName":"instance-manager-r-c7741ef8",
         "mode":"RW",
         "name":"longhorn-testvol-lvv2mx-r-77c53703",
         "running":true
      },
      {
         "address":"10.42.4.43",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/var/lib/longhorn/replicas/longhorn-testvol-lvv2mx-92f6a442",
         "diskID":"b8906b04-ad0a-4dae-801c-339957621e6e",
         "diskPath":"/var/lib/longhorn/",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-1",
         "instanceManagerName":"instance-manager-r-37b26fba",
         "mode":"RW",
         "name":"longhorn-testvol-lvv2mx-r-b5ecc458",
         "running":true
      },
      {
         "address":"10.42.2.105",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/var/lib/longhorn/replicas/longhorn-testvol-lvv2mx-787e043f",
         "diskID":"516e70f6-24fb-4724-b971-9753f4b30ff8",
         "diskPath":"/var/lib/longhorn/",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-3",
         "instanceManagerName":"instance-manager-r-115aed44",
         "mode":"RW",
         "name":"longhorn-testvol-lvv2mx-r-bd146b02",
         "running":true
      },
      {
         "address":"10.42.2.105",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/tmp/longhorn-test/vol-test/replicas/longhorn-testvol-lvv2mx-0201590b",
         "diskID":"9076b7bb-6ee2-4751-9423-b4dc0ba9cfd8",
         "diskPath":"/tmp/longhorn-test/vol-test",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-3",
         "instanceManagerName":"instance-manager-r-115aed44",
         "mode":"",
         "name":"longhorn-testvol-lvv2mx-r-dfea0dac",
         "running":true
      }
   ],

   "revisionCounterDisabled":false,
   "robustness":"healthy",
   "shareEndpoint":"",
   "shareState":"",
   "size":"16777216",
   "staleReplicaTimeout":0,
   "standby":false,
   "state":"attached"
}

When the replica count is updated to 1 and the replica on other nodes (that do not contain the evicting disk) are removed, the volume just contains 2 replicas: one evicting and healthy replica (the old replica), and a new rebuilding replica. At the next moment, Longhorn blindly removes the only healthy replica for the eviction regardless of it’s the only healthy one.

# The volume status after removing the replicas and before checking the volume data (step7). Some irrelevant info is removed.

{
   "controllers":[
      {
         "actualSize":"4096",
         "address":"10.42.2.104",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "endpoint":"/dev/longhorn/longhorn-testvol-lvv2mx",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "hostId":"shuo-k8s-worker-3",
         "instanceManagerName":"instance-manager-e-e331ff16",
         "isExpanding":false,
         "lastExpansionError":"",
         "lastExpansionFailedAt":"",
         "lastRestoredBackup":"",
         "name":"longhorn-testvol-lvv2mx-e-1bdf3b92",
         "requestedBackupRestore":"",
         "running":true,
         "size":"16777216"
      }
   ],

   "name":"longhorn-testvol-lvv2mx",

   "numberOfReplicas":1,
   "purgeStatus":[
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"tcp://10.42.1.70:10000",
         "state":"complete"
      },
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"tcp://10.42.2.105:10000",
         "state":"complete"
      },
      {
         "error":"",
         "isPurging":false,
         "progress":100,
         "replica":"longhorn-testvol-lvv2mx-r-b5ecc458",
         "state":"complete"
      }
   ],

   "replicas":[
      {
         "address":"10.42.4.43",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/var/lib/longhorn/replicas/longhorn-testvol-lvv2mx-92f6a442",
         "diskID":"b8906b04-ad0a-4dae-801c-339957621e6e",
         "diskPath":"/var/lib/longhorn/",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-1",
         "instanceManagerName":"instance-manager-r-37b26fba",
         "mode":"RW",
         "name":"longhorn-testvol-lvv2mx-r-b5ecc458",
         "running":true
      },
      {
         "address":"10.42.2.105",
         "currentImage":"longhornio/longhorn-engine:master-head",
         "dataPath":"/tmp/longhorn-test/vol-test/replicas/longhorn-testvol-lvv2mx-0201590b",
         "diskID":"9076b7bb-6ee2-4751-9423-b4dc0ba9cfd8",
         "diskPath":"/tmp/longhorn-test/vol-test",
         "engineImage":"longhornio/longhorn-engine:master-head",
         "failedAt":"",
         "hostId":"shuo-k8s-worker-3",
         "instanceManagerName":"instance-manager-r-115aed44",
         "mode":"",
         "name":"longhorn-testvol-lvv2mx-r-dfea0dac",
         "running":true
      }
   ],

   "revisionCounterDisabled":false,
   "robustness":"healthy",
   "shareEndpoint":"",
   "shareState":"",
   "size":"16777216",
   "staleReplicaTimeout":0,
   "standby":false,
   "state":"attached"
}

There are 2 issues here:

For the test, it’s better to wait for the new replica rebuilding complete before removing other replicas.
For Longhorn itself, IMO, it should retain the evicting replica if it’s the only healthy replica of the volume.

shuo-wu on Jul 27, 2022

This issue is probably related to https://github.com/longhorn/longhorn/issues/4294 as well

shuo-wu on Jul 27, 2022

Note: this regression from https://github.com/longhorn/longhorn-manager/commit/1bdd786158f1162f7a89d2c67a7c69694efd25e3. Set auto-cleanup-system-generated-snapshot to false allows the test to pass.

c3y1huang on Jul 26, 2022