longhorn: [BUG] Unable to export RAID1 bdev in degraded state

Describe the bug (🐛 if you encounter this issue)

Unable to export RAID1 bdev in degrade state. The RAID1 bdev should be exportable if there is at least one healthy lvol.

To Reproduce

Steps to reproduce the behavior:

  1. Launch SPDK target
  2. Prepare a bdev lvol
  3. Create a bdev raid based on the newly created lvol and a non-existing lvol: sudo ~/go/src/github.com/longhorn/spdk/scripts/rpc.py bdev_raid_create -n raid-degraded -r raid1 -b "<A Valid Lvol> <A Non-existing Lvol>"
  4. Create a nvmf and use the bdev raid as the ns:
sudo ~/go/src/github.com/longhorn/spdk/scripts/rpc.py nvmf_create_subsystem nqn.2023-01.io.spdk:testvol -a -s SPDK00000000000020 -d SPDK_Controller

sudo ~/go/src/github.com/longhorn/spdk/scripts/rpc.py nvmf_subsystem_add_ns nqn.2023-01.io.spdk:testvol raid-degraded

The ns add cmd will error out.

Expected behavior

The degraded raid can be added as a nvmf subsystem NS

Log or Support bundle

If applicable, add the Longhorn managers’ log or support bundle when the issue happens. You can generate a Support Bundle using the link at the footer of the Longhorn UI.

Environment

  • Longhorn version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl):
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version:
    • Number of management node in the cluster:
    • Number of worker node in the cluster:
  • Node config
    • OS type and version:
    • CPU per node:
    • Memory per node:
    • Disk type(e.g. SSD/NVMe):
    • Network bandwidth between the nodes:
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal):
  • Number of Longhorn volumes in the cluster:

Additional context

cc @longhorn/dev-data-plane

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 30 (28 by maintainers)

Most upvoted comments

If I call bdev_nvme_attach_controller with ctrlr_loss_timeout_sec and reconnect_delay_sec valorized, on local node after a while the remote controller is detached and:

  • remote bdev disappear
  • raid show the base bdev as not configured and the discovered base bdevs is correctly decremented
  • I can add this raid to a nvmf subsystem and
  • I can connect from Linux to this subsystem and write over its /dev/nvme1n1 device

Check

  • Node down
  • Wait for a while
  • Node up
  • Check raid behavior

In SPDK Gerrit there is this development under review: https://review.spdk.io/gerrit/c/spdk/spdk/+/16167 It is the last commit of the relation chain, so it contains all the last development made over Raid module. @shuo-wu If you want you can start to work with this version, until it will be merged over master branch it will be our base version for replica rebuilding development

Suppose this is the situation before the down: we have a remote lvol bdev Nvme2n1

    "name": "Nvme2n1",
    "aliases": [
      "1c325c17-2358-4f57-b819-005a68cc1fd0"
    ],
    "product_name": "NVMe disk",
    "block_size": 512,
    "num_blocks": 204800,
    "uuid": "1c325c17-2358-4f57-b819-005a68cc1fd0",
    "assigned_rate_limits": {
      "rw_ios_per_sec": 0,
      "rw_mbytes_per_sec": 0,
      "r_mbytes_per_sec": 0,
      "w_mbytes_per_sec": 0
    },
    "claimed": true,
    "claim_type": "exclusive_write",

and the raid1

        "base_bdevs_list": [
          {
            "name": "Nvme2n1",
            "uuid": "1c325c17-2358-4f57-b819-005a68cc1fd0",
            "is_configured": true,
            "data_offset": 0,
            "data_size": 32768,
            "mode": "rw"
          },
          {
            "name": "lvstore1/lvol2",
            "uuid": "6ce913dd-24a1-4eab-a69b-81a0d139b9d2",
            "is_configured": true,
            "data_offset": 0,
            "data_size": 32768,
            "mode": "rw"
          }
        ]

At one time the remote node goes down and the local nvme controller, after 10 seconds of reconnections, is deleted. When the remote node comes up again, and it export its lvol via nvmf, if on local node we attach again to the remote lvol and so the local bdev Nvme2n1 reappears, it is not handled anymore from the raid. Once a base bdev has been removed from raid, it is not automatically readded. So, we remain with only one base bdev even if bdev Nvme2n1 is present:

        "base_bdevs_list": [
          {
            "name": null,
            "uuid": "00000000-0000-0000-0000-000000000000",
            "is_configured": false,
            "data_offset": 0,
            "data_size": 32768,
            "mode": "rw"
          },
          {
            "name": "lvstore1/lvol2",
            "uuid": "6ce913dd-24a1-4eab-a69b-81a0d139b9d2",
            "is_configured": true,
            "data_offset": 0,
            "data_size": 32768,
            "mode": "rw"
          }
        ]
  {
    "name": "Nvme2n1",
    "aliases": [
      "1c325c17-2358-4f57-b819-005a68cc1fd0"
    ],
    "product_name": "NVMe disk",
    "block_size": 512,
    "num_blocks": 204800,
    "uuid": "1c325c17-2358-4f57-b819-005a68cc1fd0",
    "assigned_rate_limits": {
      "rw_ios_per_sec": 0,
      "rw_mbytes_per_sec": 0,
      "r_mbytes_per_sec": 0,
      "w_mbytes_per_sec": 0
    },
    "claimed": false,
...

Yes, with these parameters the number of retries is correct. If the remote node goes down when I/O is in progress, the read/write is stuck for the reconnection phase (in previous case 10 seconds). After this timeout:

  • the nvme controller is deleted
  • the remote bdev is deleted
  • the raid lose one base bdev
  • the I/O restart

@DamiaSan Just notice it is based on running IO?

We still need an extra method rather than running io used for monitoring the connectivity of lvols. If one lvol is somehow down, the rebuilding can be triggered.

The remaining issue

When there is a remote base bdev down, we cannot add this RAID bdev as a nvmf subsystem namespace. Besides, if we already create a device (nvme-cli initiator) and apply IO based on a RAID bdev, suddenly shutting down a remote base bdev may be a problem as well.

The reproducing step

Test 1 (Verified):

  1. Create a local lvol on node 0.
  2. Create and expose a lvol bdev on node 1.
  3. Attach the remote lvol on node 0.
  4. Create a RAID bdev based on these 2 bdevs.
  5. Stop spdk_tgt on node 1, which leads to the remote lvol on node 0 down
  6. Trying to expose the RAID will fail.

Test 2 (Unverified):

  1. Create a local lvol on node 0.
  2. Create and expose a lvol bdev on node 1.
  3. Attach the remote lvol on node 0.
  4. Create and expose a RAID bdev based on these 2 bdevs.
  5. Create nvme initiator (nvme-cli discover and connect) for the exposed RAID.
  6. Keep applying IO to the device (which is the nvme initiator)
  7. Stop spdk_tgt on node 1, which leads to the remote lvol on node 0 down
  8. Verify if the IO works and if there is data corruption

note: “expose a bdev” means creating nvmf subsystem, namespace, and a listener for the bdev

The related test result

For Test 1, after stopping the remote node spdk_tgt, getting the raid does not tell users that there is an invalid base bdev:

"raid": {
    "base_bdevs_list": [
        {
            "data_offset": 0,
            "data_size": 25600,
            "is_configured": true,
            "name": "spdk-00/lvol0",
            "uuid": "dcb09a53-549b-4dd4-9612-3f129ca80518"
        },
        {
            "data_offset": 0,
            "data_size": 25600,
            "is_configured": true,
            "name": "lvol1n1",
            "uuid": "2b88167e-322d-42db-982c-12e5281ba904"
        }
    ],
    "num_base_bdevs": 2,
    "num_base_bdevs_discovered": 2,
    "num_base_bdevs_operational": 2,
    "raid_level": "raid1",
    "state": "online",
    "strip_size_kb": 0,
    "superblock": false
}

And trying to expose the RAID will show the below weird error:

{
        "id": 70002,
        "jsonrpc": "2.0",
        "method": "nvmf_create_subsystem",
        "params": {
                "nqn": "nqn.2023-01.io.longhorn.spdk:raid01",
                "allow_any_host": true
        }
}
{
        "id": 70002,
        "jsonrpc": "2.0",
        "result": true
}
{
        "id": 70003,
        "jsonrpc": "2.0",
        "method": "nvmf_subsystem_add_ns",
        "params": {
                "nqn": "nqn.2023-01.io.longhorn.spdk:raid01",
                "namespace": {
                        "bdev_name": "raid01"
                }
        }
}
{
        "id": 70003,
        "jsonrpc": "2.0",
        "error": {
                "code": -32603,
                "message": "Unable to add ns, subsystem in active state"
        }
}
FATA[0000] Failed to run start expose command            error="error sending message, id 70003, method nvmf_subsystem_add_ns, params {Nqn:nqn.2023-01.io.longhorn.spdk:raid01 Namespace:{Nsid:0 BdevName:raid01 Nguid: Eui64: UUID: Anagrpid: PtplFile:} TgtName:}: {\n\t\"code\": -32603,\n\t\"message\": \"Unable to add ns, subsystem in active state\"\n}"

But trying to get the down remote lvol (which is considered as a nvme bdev on node0) with API bdev_nvme_get_controllers rather than bdev_get_bdevs will show the state failed:

        {
                "name": "lvol1",
                "ctrlrs": [
                        {
                                "state": "failed",
                                "cntlid": 3,
                                "trid": {
                                        "trtype": "TCP",
                                        "adrfam": "IPv4",
                                        "traddr": "24.199.115.72",
                                        "trsvcid": "4420",
                                        "subnqn": "nqn.2023-01.io.longhorn.spdk:lvol1"
                                },
                                "host": {
                                        "nqn": "nqn.2014-08.org.nvmexpress:uuid:aef07dc4-f291-48f2-9fee-1541f47a8519",
                                        "addr": "",
                                        "svcid": ""
                                }
                        }
                ]
        }

Expected behaviors

  1. Getting the RAID will tell users which ones are invalid lvols with correct num_base_bdevs_operational.
  2. A RAID containing invalid lvols as well as healthy lvols should will work. It should be able to get exposed and the IO should functions.

@shuo-wu I have verified this issue in the version https://review.spdk.io/gerrit/c/spdk/spdk/+/16167 and still persists. The raid bdev is not registered and so nvmf can’t add it to the subsystem. In this commit have been made some works over operational base bdevs, I left a comment in Gerrit because I think that min_base_bdevs_operational should be used (in raid1 it is equal to 1) to determine if to configure the raid bdev