longhorn: [BUG] Improve Kubernetes node drain support

Describe the bug After rolling k8s nodes with StatefulSet workloads in AWS auto-scaling group — first creating new nodes, then cordoning and draining old nodes, then one by one terminating the EC2 instance. Some (out of many) volumes ended up “attached” to a non-existing node. This caused a StatefulSet pod to get stuck after being scheduled to a new node. The “Down” node couldn’t be deleted from the pool either before manually detaching the volume.

To Reproduce Not sure if this reproduces reliably, but I’ve seen this twice now.

  1. Deploy a number of StatefulSets with PVCs on a k8s node group deployed as an AWS auto-scaling group
  2. Update the ASG launch template
  3. Roll the group with https://github.com/hellofresh/eks-rolling-update
  4. Observe some pods unable to be scheduled due to failure in volume attachment

Expected behavior Volumes should get detached automatically from nodes that are being drained, and especially the ones that don’t exist at all.

Log Failing pod event:

AttachVolume.Attach failed for volume "pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c" : rpc error: code = FailedPrecondition desc = The volume pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c cannot be attached to the node ip-10-0-197-215.eu-west-2.compute.internal since it is already attached to the node ip-10-0-110-63.eu-west-2.compute.internal

Longhorn volume event:

Error stopping pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-r-13878a7f: Operation cannot be fulfilled on instancemanagers.longhorn.io "instance-manager-r-9672e5cb": the object has been modified; please apply your changes to the latest version and try again

Many errors like this in one of the longhorn-manager logs:

1129:2020-07-22T21:02:48.974414429Z E0722 21:02:48.974300       1 engine_controller.go:668] fail to update status for engine pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-e-50a061ed: failed to list replicas from controller 'pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c': Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.0.0/longhorn [--url 10.0.88.237:10001 ls], output , stderr, time="2020-07-22T21:02:48Z" level=fatal msg="Error running ls command: failed to list replicas for volume 10.0.88.237:10001: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.88.237:10001: connect: no route to host\""

Environment:

  • Longhorn version: 1.0.0
  • Kubernetes version: EKS 1.17.6
  • Node OS type and version: EKS-optimised Amazon Linux 2

Additional context This happened when workload nodes are hosting some volumes and when there’s a separate group for Longhorn volumes.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 31 (16 by maintainers)

Most upvoted comments

We get the same problem occasionally when we drain our nodes. As a workaround, we deploy a helper program on every node by DaemonSet. The helper program blocks deletion of Longhorn Instance Manager pods until all pods which mount Longhorn volumes finish to detach during kubectl drain. It works fine for our situation, so we would like to share something we’ve learned.

Environment

  • Kubernetes v1.18.6
  • Longhorn v1.0.2 with fix for node.session.scan (custom build)
  • Ubuntu 18.04
  • Set node.session.scan = manual in iscsid.conf
  • Our clusters are recreated everyday for testing purposes in dev environment.
    • Recreate clusters by creating and deleting nodes one by one.
    • Every node is drained by kubectl drain before deletion.

Problem

We had the following problems when we recreated clusters by draining nodes:

  • The volume remains stuck occasionally bound after draining (same as this issue)
    • Instance Manager (Engine and Replica) pods are not managed by DaemonSet, so when a node is drained, both Instance Manager pods and application pods which mount Longhorn volumes are deleted simultaneously. Therefore application pods cannot detach properly during kubectl drain.
    • DaemonSet managed pods are not deleted by kubectl drain. We use --ignore-daemonsets option to ignore them.
  • All replicas of volumes are lost if draining nodes happens too fast.
  • Frequent deleting of replicas generates many old snapshots.

Workaround

As a workaround, we deploy the following helper program on every node by DaemonSet (We call it longhorn-evictor). With this helper, we do not see the problems above.

Source code: https://gist.github.com/tksm/667c0562009df7c57a8cc1126d68fc52#file-main-go (not for production use)

Basically it works as follows on every node.

  • Create PodDisruptionBudeget for Longhorn Instance Manager on the node to block deletion during kubectl drain.
  • Wait for the node to become Unschedulable.
    • Consider it as kubectl drain start.
  • Wait for all longhorn volumes on the node to be detached.
    • Ensure there are no longhorn volumes whose currentNodeID or ownerID is assigned this node.
  • Evict Longhorn replicas on the node by deleting replicas.longhorn.io objects.
  • Delete PodDisruptionBudeget for Longhorn Instance Manager on the node.
    • Longhorn Instance Manager pods will be safely deleted.

PodDisruptionBudeget for Longhorn Instance Manager (Engine) looks like this. maxUnavailable: 0 blocks deletion of pods.

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  labels:
    stateful.zlab.co.jp/app: longhorn-evictor
  name: longhorn-engine-demo-tkusumi118-worker-default-9d5e2007-9k8hb
  namespace: longhorn-system
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      longhorn.io/component: instance-manager
      longhorn.io/instance-manager-type: engine
      longhorn.io/node: demo-tkusumi118-worker-default-9d5e2007-9k8hb

@tksm Using PodDisruptionBudget is a genius idea! We will see if we can use your idea to fix #1286 and this issue in v1.1.0.

@yasker Just wanted to mention that we tried out this PDB when draining nodes and it worked flawlessly! Every workload that uses volumes migrated in a matter of seconds with nothing getting stuck, improving general service stability a lot during maintenance. Thanks for this @tksm!

@tksm Thanks for details.

For others who are using pre 1.1.0 following script will help

kubectl drain --ignore-daemonsets --delete-emptydir-data --pod-selector=longhorn.io/component!=instance-manager
for vol in `kubectl -n longhorn-system get Volume -o json |  jq '.items[] | .metadata.name' | sed 's/"//g'` ; do
  
  while [[ $(kubectl  -n longhorn-system get Volume $vol -o 'jsonpath={.status.state}') != "detached" ]]; 
  do 
	echo "waiting volume to detach." && sleep 1; 
  done
  
done

@PhanLe1010 I understand #298 is for manual eviction, that will not help in OKD/Openshift and other solutions with automated rolling upgrades. Openshift cluster upgrade is completely automated, and drains/upgrades/reboots nodes one at a time (or specified % of nodes).

Protecting replicas with PDB would be nice, you should be unable to “drain --force” the node if there’s no other ready/synced replica somewhere else.

I am getting all the same type errors after kured ran across my nodes today…

k3s 1.18.8 longhorn 1.0.2

longhorn-support-bundle_8e0655a7-95cd-4269-840f-89a99f866561_2020-09-03T17-41-55Z.zip

Here may be the root cause. Since some logs are missed in the support bundle, I cannot guarantee the analysis is accurate:

  1. During the node draining, the related engine manager pod is evicted hence the volume on the node gets crashed unexpectedly. Then Longhorn will automatically detach and reattach the volume as the auto-recovery mechanism. (Cannot find any log here.)
  2. During the auto detachment, the workload becomes terminating and Kubernetes will send the detach request to Longhorn CSI plugin.
  3. The CSI plugin finds the volume is already detaching/detached hence it will blindly report success. But actually this detachment is not controlled/issued by Kubernetes. And there will be an auto reattachment operation after the detachment. The detachment is a transient state for the volume.
2020-07-24T15:16:05.012899952Z time="2020-07-24T15:16:05Z" level=info msg="GRPC request: {\"node_id\":\"ip-10-0-123-229.eu-west-2.compute.internal\",\"volume_id\":\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\"}"
2020-07-24T15:16:05.012960414Z time="2020-07-24T15:16:05Z" level=info msg="ControllerServer ControllerUnpublishVolume req: volume_id:\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\" node_id:\"ip-10-0-123-229.eu-west-2.compute.internal\" "
2020-07-24T15:16:05.015096193Z time="2020-07-24T15:16:05Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7 is detaching"

......

2020-07-24T15:17:08.079802047Z time="2020-07-24T15:17:08Z" level=info msg="GRPC call: /csi.v1.Controller/ControllerUnpublishVolume"
2020-07-24T15:17:08.080826605Z time="2020-07-24T15:17:08Z" level=info msg="GRPC request: {\"node_id\":\"ip-10-0-123-229.eu-west-2.compute.internal\",\"volume_id\":\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\"}"
2020-07-24T15:17:08.082716697Z time="2020-07-24T15:17:08Z" level=info msg="ControllerServer ControllerUnpublishVolume req: volume_id:\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\" node_id:\"ip-10-0-123-229.eu-west-2.compute.internal\" "
2020-07-24T15:17:08.084673822Z time="2020-07-24T15:17:08Z" level=info msg="don't need to detach Volume pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7 since we are already detached from node ip-10-0-123-229.eu-west-2.compute.internal"
  1. Kubernetes is informed that the volume is detached then it will launch a new workload with the volume. But the volume has been automatically reattached to the draining node (before the node is gone). ((Cannot find any logs for this reattachment) Then the workload sticks there.
2020-07-24T15:17:14.212174175Z time="2020-07-24T15:17:14Z" level=info msg="ControllerServer ControllerPublishVolume req: volume_id:\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\" node_id:\"ip-10-0-235-9.eu-west-2.compute.internal\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"baseImage\" value:\"\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"numberOfReplicas\" value:\"3\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1595117863611-8081-driver.longhorn.io\" > "
2020-07-24T15:17:14.216267097Z time="2020-07-24T15:17:14Z" level=error msg="GRPC error: rpc error: code = Aborted desc = The volume pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c is attaching"

......

2020-07-24T15:17:29.983502391Z time="2020-07-24T15:17:29Z" level=info msg="ControllerServer ControllerPublishVolume req: volume_id:\"pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7\" node_id:\"ip-10-0-235-9.eu-west-2.compute.internal\" volume_capability:<mount:<fs_type:\"ext4\" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:\"baseImage\" value:\"\" > volume_context:<key:\"fromBackup\" value:\"\" > volume_context:<key:\"numberOfReplicas\" value:\"3\" > volume_context:<key:\"staleReplicaTimeout\" value:\"30\" > volume_context:<key:\"storage.kubernetes.io/csiProvisionerIdentity\" value:\"1595117863611-8081-driver.longhorn.io\" > "
2020-07-24T15:17:29.987886364Z time="2020-07-24T15:17:29Z" level=error msg="GRPC error: rpc error: code = FailedPrecondition desc = The volume pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7 cannot be attached to the node ip-10-0-235-9.eu-west-2.compute.internal since it is already attached to the node ip-10-0-123-229.eu-west-2.compute.internal"
  1. The node is gone then the volume and the related engine manager pod is considered as Unknown.
- typemeta:
    kind: ""
    apiversion: ""
  objectmeta:
    name: instance-manager-e-4aef6e75
    generatename: ""
    namespace: longhorn-system
    selflink: /apis/longhorn.io/v1beta1/namespaces/longhorn-system/instancemanagers/instance-manager-e-4aef6e75
    uid: b66e5a91-c2d6-499f-808f-57447d910973
    resourceversion: "11444629"
    generation: 1
    creationtimestamp: "2020-07-24T13:05:53Z"
    deletiontimestamp: null
    deletiongraceperiodseconds: null
    labels:
      longhorn.io/component: instance-manager
      longhorn.io/instance-manager-image: longhornio-longhorn-instance-manager-v1_20200514
      longhorn.io/instance-manager-type: engine
      longhorn.io/node: ip-10-0-123-229.eu-west-2.compute.internal
    annotations: {}
    ownerreferences:
    - apiversion: longhorn.io/v1beta1
      kind: Node
      name: ip-10-0-123-229.eu-west-2.compute.internal
      uid: fa0df2e5-c451-4d1d-bdaf-877a8f47ba0f
      controller: null
      blockownerdeletion: true
    initializers: null
    finalizers:
    - longhorn.io
    clustername: ""
    managedfields: []
  spec:
    image: longhornio/longhorn-instance-manager:v1_20200514
    nodeid: ip-10-0-123-229.eu-west-2.compute.internal
    type: engine
    engineimage: ""
  status:
    ownerid: ip-10-0-218-139.eu-west-2.compute.internal
    currentstate: unknown
    instances:
      pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7-e-bfe75919:
        spec:
          name: pvc-3d6fe7af-7536-4367-8a33-249d8c2550d7-e-bfe75919
        status:
          endpoint: ""
          errormsg: ""
          listen: ""
          portend: 10009
          portstart: 10009
          state: running
          type: ""
          resourceversion: 0
      pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-e-50a061ed:
        spec:
          name: pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-e-50a061ed
        status:
          endpoint: ""
          errormsg: ""
          listen: ""
          portend: 10008
          portstart: 10008
          state: running
          type: ""
          resourceversion: 0
    ip: 10.0.114.231
    apiminversion: 1
    apiversion: 1

The simplest workaround is manually detaching the volume. @yasker To solve this issue, we may need to distinguish auto detach&reattach from regular detach and attach.

BTW, according to the above analysis, this issue is caused by the race condition. I am not sure why it is not triggered without setting PDBs.

Without the autoscaling group, I was not able to repro the issue. After draining of node, the pod moves to another node and volume automatically first get detached and then get attached to the new node. I’ll try to repro this with autoscaling and update.