longhorn: [BUG] Improve Kubernetes node drain support
Describe the bug
After rolling k8s nodes with StatefulSet workloads in AWS auto-scaling group — first creating new nodes, then cordoning and draining old nodes, then one by one terminating the EC2 instance. Some (out of many) volumes ended up “attached” to a non-existing node. This caused a StatefulSet pod to get stuck after being scheduled to a new node. The “Down” node couldn’t be deleted from the pool either before manually detaching the volume.
To Reproduce Not sure if this reproduces reliably, but I’ve seen this twice now.
- Deploy a number of
StatefulSets with PVCs on a k8s node group deployed as an AWS auto-scaling group - Update the ASG launch template
- Roll the group with https://github.com/hellofresh/eks-rolling-update
- Observe some pods unable to be scheduled due to failure in volume attachment
Expected behavior Volumes should get detached automatically from nodes that are being drained, and especially the ones that don’t exist at all.
Log Failing pod event:
AttachVolume.Attach failed for volume "pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c" : rpc error: code = FailedPrecondition desc = The volume pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c cannot be attached to the node ip-10-0-197-215.eu-west-2.compute.internal since it is already attached to the node ip-10-0-110-63.eu-west-2.compute.internal
Longhorn volume event:
Error stopping pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-r-13878a7f: Operation cannot be fulfilled on instancemanagers.longhorn.io "instance-manager-r-9672e5cb": the object has been modified; please apply your changes to the latest version and try again
Many errors like this in one of the longhorn-manager logs:
1129:2020-07-22T21:02:48.974414429Z E0722 21:02:48.974300 1 engine_controller.go:668] fail to update status for engine pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c-e-50a061ed: failed to list replicas from controller 'pvc-d937c093-5ba2-4ee7-9abe-563cb94d215c': Failed to execute: /var/lib/longhorn/engine-binaries/longhornio-longhorn-engine-v1.0.0/longhorn [--url 10.0.88.237:10001 ls], output , stderr, time="2020-07-22T21:02:48Z" level=fatal msg="Error running ls command: failed to list replicas for volume 10.0.88.237:10001: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.88.237:10001: connect: no route to host\""
Environment:
- Longhorn version: 1.0.0
- Kubernetes version: EKS 1.17.6
- Node OS type and version: EKS-optimised Amazon Linux 2
Additional context This happened when workload nodes are hosting some volumes and when there’s a separate group for Longhorn volumes.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 31 (16 by maintainers)
We get the same problem occasionally when we drain our nodes. As a workaround, we deploy a helper program on every node by DaemonSet. The helper program blocks deletion of Longhorn Instance Manager pods until all pods which mount Longhorn volumes finish to detach during
kubectl drain. It works fine for our situation, so we would like to share something we’ve learned.Environment
node.session.scan = manualin iscsid.confkubectl drainbefore deletion.Problem
We had the following problems when we recreated clusters by draining nodes:
kubectl drain.kubectl drain. We use--ignore-daemonsetsoption to ignore them.Workaround
As a workaround, we deploy the following helper program on every node by DaemonSet (We call it
longhorn-evictor). With this helper, we do not see the problems above.Source code: https://gist.github.com/tksm/667c0562009df7c57a8cc1126d68fc52#file-main-go (not for production use)
Basically it works as follows on every node.
kubectl drain.Unschedulable.kubectl drainstart.currentNodeIDorownerIDis assigned this node.replicas.longhorn.ioobjects.PodDisruptionBudeget for Longhorn Instance Manager (Engine) looks like this.
maxUnavailable: 0blocks deletion of pods.@tksm Using
PodDisruptionBudgetis a genius idea! We will see if we can use your idea to fix #1286 and this issue in v1.1.0.@yasker Just wanted to mention that we tried out this PDB when draining nodes and it worked flawlessly! Every workload that uses volumes migrated in a matter of seconds with nothing getting stuck, improving general service stability a lot during maintenance. Thanks for this @tksm!
@tksm Thanks for details.
For others who are using pre 1.1.0 following script will help
@PhanLe1010 I understand #298 is for manual eviction, that will not help in OKD/Openshift and other solutions with automated rolling upgrades. Openshift cluster upgrade is completely automated, and drains/upgrades/reboots nodes one at a time (or specified % of nodes).
Protecting replicas with PDB would be nice, you should be unable to “drain --force” the node if there’s no other ready/synced replica somewhere else.
I am getting all the same type errors after kured ran across my nodes today…
k3s 1.18.8 longhorn 1.0.2
longhorn-support-bundle_8e0655a7-95cd-4269-840f-89a99f866561_2020-09-03T17-41-55Z.zip
Here may be the root cause. Since some logs are missed in the support bundle, I cannot guarantee the analysis is accurate:
Unknown.The simplest workaround is manually detaching the volume. @yasker To solve this issue, we may need to distinguish auto detach&reattach from regular detach and attach.
BTW, according to the above analysis, this issue is caused by the race condition. I am not sure why it is not triggered without setting PDBs.
Without the autoscaling group, I was not able to repro the issue. After draining of node, the pod moves to another node and volume automatically first get detached and then get attached to the new node. I’ll try to repro this with autoscaling and update.