longhorn: [IMPROVEMENT] Use PDB to protect Longhorn components from unexpected drains

When using kubectl drain on a single node cluster there might be a case where it’s impossible to correctly cleanup/detach a volume. Since the csi sidecars might end up evicted before the pod/va is cleaned up correctly.

It would be useful to have PDB on the csi-sidecars, to prevent all sidecars from being unavailable. We could consider removing the PDB once we have evaluated that all longhorn volumes are no longer in use. But even as a first step just adding a PDB with a minAvailablity of 1 for the csi sidecars would be beneficial.

A similiar thing would be useful for the share-manager, since it’s possible that the share-manager pod gets evicted before the nfs unmount happens in the csi-plugin which might stall the kubelet/csi-plugin. in single node clusters the nfs mount would get stuck since the share-manager would no longer come up, since the node is cordoned during the drain process.


A separate idea is to look into providing a script to ignore the longhorn components/namespace as part of the drain. Since if you drain all workloads that use the longhorn volumes on that node, there wouldn’t be any active volumes left on that node. This does not apply to replica and share-manager since they could be used by a different node in a multi node cluster.

Simliar to what kubevirt does in it’s maintenace guide provide a selector that drains the appropriate components. REF: https://kubevirt.io/user-guide/operations/node_maintenance/#evict-all-vms-from-a-node REF: https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 15 (13 by maintainers)

Most upvoted comments

Test Plan

1. Basic unit tests

1.1 Single node cluster

1.1.1 RWO volumes

  • Deploy Longhorn
  • Verify that there is no PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Manually create a PVC (simulate the volume which has never been attached scenario)
  • Verify that there is no PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook because there is no attached volume
  • Create a deployment that uses one RW0 Longhorn volume.
  • Verify that there is PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
  • Drain the node by kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
  • Observe that the workload pods are evited first -> PDB of csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook are removed -> csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook, and instance-manager-e pods are evicted -> all volumes are successfully detached
  • Observe that instance-manager-r is NOT evicted. In the current design, if the node contains the only healthy replica of the volume, the instance-manager-r cannot be evicted. This issue will be addressed at the ticket https://github.com/longhorn/longhorn/issues/5549

1.1.2 RWX volume

  • Deploy Longhorn
  • Verify that there is no PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Create a deployment of 2 pods that uses one RWX Longhorn volume.
  • Verify that there is PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Drain the node by kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
  • Observe that the workload pods are evited first -> PDB of csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook are removed -> csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook, and instance-manager-e pods are evicted -> all volumes are successfully detached
  • Observe that instance-manager-r is NOT evicted. In the current design, if the node contains the only healthy replica of the volume, the instance-manager-r cannot be evicted. This issue will be addressed at the ticket https://github.com/longhorn/longhorn/issues/5549

1.2 multi-node cluster

  • Deploy Longhorn
  • Verify that there is no PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Manually create a PVC (simulate the volume which has never been attached scenario)
  • Verify that there is no PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook because there is no attached volume
  • Create a deployment that uses one RW0 Longhorn volume.
  • Verify that there is PDB for csi-attacher, csi-provisioner, longhorn-admission-webhook, and longhorn-conversion-webhook
  • Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
  • Create a deployment of 2 pods that uses one RWX Longhorn volume.
  • For each node one by one by kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
  • Verify that the drain can finish successfully
  • Uncordon the node and move to next node

2. Upgrade Kubernetes for k3s cluster with standalone System Upgrade Controller deployment

  • Deploy a 3 nodes with each node has all roles (master + worker)
  • Install the System Upgrade Controller
  • Deploy Longhorn
  • Manually create a PVC (simulate the volume which has never been attached scenario)
  • Create a deployment that uses one RW0 Longhorn volume.
  • Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
  • Create another deployment of 2 pods that uses one RWX Longhorn volume.
  • Deploying the plan CR to upgrade Kubernetes similar to:
    apiVersion: upgrade.cattle.io/v1
    kind: Plan
    metadata:
      name: k3s-server
      namespace: system-upgrade
    spec:
      concurrency: 1
      cordon: true
      nodeSelector:
        matchExpressions:
        - key: node-role.kubernetes.io/master
          operator: In
          values:
          - "true"
      serviceAccountName: system-upgrade
      drain:
        force: true
        skipWaitForDeleteTimeout: 60 # 1.18+ (honor pod disruption budgets up to 60 seconds per pod then moves on)
      upgrade:
        image: rancher/k3s-upgrade
      version: v1.21.11+k3s1
    
    Note that the concurrency should be 1 to upgrade node one by one. version should be a newer K3s version. And it should contains the drain stage
  • Verify that the upgrade went smoothly
  • Exec into workload pod and make sure that the data is still there
  • Repeat the upgrading process above 5 times to make sure

3. Upgrade Kubernetes for imported k3s cluster in Rancher

  • Creating a 3-node k3s cluster with each node is both master+worker role. K3s should be an old version such as v1.21.9+k3s1 so that we can upgrade multiple times. Some instructions to create such cluster is here https://docs.k3s.io/datastore/ha-embedded
  • Import the cluster into Rancher by: go to cluster management -> create new cluster -> generic cluster -> follow the instruction over there
  • Update the upgrade strategy in cluster management -> click three dots menu on the imported cluster -> edit config -> K3s options -> close drain for both control plane and worker node like below: Screenshot from 2023-03-14 17-53-24
  • Install Longhorn
  • Manually create a PVC (simulate the volume which has never been attached scenario)
  • Create a deployment that uses one RW0 Longhorn volume.
  • Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
  • Create another deployment of 2 pods that uses one RWX Longhorn volume.
  • Using Rancher to upgrade the cluster to a newer Kubernetes version
  • Verify that the upgrade went smoothly
  • Exec into workload pod and make sure that the data is still there

4. Upgrade Kubernetes for provisioned k3s cluster in Rancher

  • Using Rancher to provision a k3s cluster with an old version. For example, v1.22.11+k3s2. The cluster has 3 nodes each node with both worker and master role. Set the upgrade strategy as below: Screenshot from 2023-03-14 15-44-34
  • Install Longhorn
  • Manually create a PVC (simulate the volume which has never been attached scenario)
  • Create a deployment that uses one RW0 Longhorn volume.
  • Create another deployment that uses one RWO Longhorn volume. Scale down this deployment so that the volume is detached
  • Create another deployment of 2 pods that uses one RWX Longhorn volume.
  • Using Rancher to upgrade the cluster to a newer Kubernetes version
  • Verify that the upgrade went smoothly
  • Exec into workload pod and make sure that the data is still there

After updating the doc including node maintenance and Kubernetes upgrade (we don’t have this yet, so probably can create a specific page for it, but overall it’s similar to node maintenance or we would like to combine them together), let’s move this to ready-for-testing.

The goal is to prevent the volume-lifecycle managed pods from being drained by the drain operation, so basically having PDBs to protect the longhorn webhooks, CSI sidecars, and CSI deployer while there are volume engines still running on the node.

After all running volumes are evicted on the draining node, can delete those PBDs to make the drain operation continue. @PhanLe1010 WDYT?

@PhanLe1010 just noticed the doc has not updated with the latest excluded pod selector including webhook. Please help with that update. Thanks.

Assuming that the goal is get rid of old versions of instance managers, I have tested the following steps:

  1. Stop the automatic engine upgrade
  2. Cordon all nodes
  3. For each node:
    1. drain by the command kubectl drain --pod-selector='app!=csi-attacher,app!=csi-provisioner,[longhorn.io/component!=instance-manager,app!=longhorn-admission-webhook,app!=longhorn-conversion-webhook](http://longhorn.io/component!=instance-manager,app!=longhorn-admission-webhook,app!=longhorn-conversion-webhook)' <node> --ignore-daemonsets
    2. uncordon the node
    3. wait for the workload of this node to fully comeback
    4. cordon this node again
    5. move to next node
  4. Wait until all old instance managers are removed by Longhorn
  5. Uncordon all nodes
  6. Enable the automatic engine upgrade

Just leaving this note here for future reference. If someone wants to do a shutdown of all volumes without having to modify the scale of the end user/application deployments in the cluster. As well as a zero rebuild maintenance.

How useful this is depends on the situation 😃

  • they can cordon each node first
  • then run the below drain command on each node this will ignore all the instance-managers + webhooks
  • this will only evict the workloads since each node is cordoned, the scheduler will be unable to place the workloads on any other node and therefore all volumes will be shutdown cleanly and all engine/replica processes will be shutdown cleanly.

kubectl drain --pod-selector='app!=csi-attacher,app!=csi-provisioner,longhorn.io/component!=instance-manager,app!=longhorn-admission-webhook,app!=longhorn-conversion-webhook' <node> --ignore-daemonsets

Note from @innobead

So, I think this should be good enough for now, but probably we should also update our doc for upgrade part about the drain usage. https://longhorn.io/docs/1.2.2/volumes-and-nodes/maintenance/#updating-the-node-os-or-container-runtime

I tested the label selector idea and it works well, as a current workaround, example below: We just need to fix up/standarize our labels the current im has a nice set of labels already.

# not equal x (includes absent key)
kubectl drain --pod-selector='!longhorn.io/component' jmoody-lh-work3 --ignore-daemonsets

# not equal x and not equal y selector
kubectl drain --pod-selector='!longhorn.io/component,app!=csi-attacher,app!=csi-provisioner' jmoody-lh-work1 --ignore-daemonsets

Workaround for the csi side cars getting drained: kubectl drain --pod-selector='app!=csi-attacher,app!=csi-provisioner' <node> --ignore-daemonsets