longhorn: [BUG] After an upgrade of longhorn from v1.0.0 to v1.0.1rc longhorn-driver-deployer is on crashloop.

Describe the bug After an upgrade of longhorn from v1.0.0 to v1.0.1rc1 longhorn-driver-deployer is on crashloop. Verify the logs at the end of this issue write up. Beside that, I’m able to use the storageclass. I admit I’m not sure if an issue with longhorn-driver-deployer can cause more problems later.

To Reproduce Steps to reproduce the behavior:

Install longhorn v1.0.0 from helm
Upgrade longhorn to v1.0.1 with helm
See error using

Expected behavior

Log

kubectl logs  -n longhorn-system  longhorn-driver-deployer-8547759f46-6r2qd -p
time="2020-07-17T06:55:42Z" level=debug msg="Deploying CSI driver"
time="2020-07-17T06:55:44Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:45Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:47Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:48Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:49Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:50Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:51Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:52Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:53Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:54Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:55Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:57Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:58Z" level=info msg="Proc found: kubelet"
time="2020-07-17T06:55:58Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=10.10.10.2 --hostname-override=node2 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=k8s.gcr.io/pause:3.1 --dynamic-config-dir=/etc/kubernetes/dynamic_kubelet_dir --runtime-cgroups=/systemd/system.slice --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin ]"
time="2020-07-17T06:55:58Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/local/bin/kubelet\x00--logtostderr=true\x00--v=2\x00--node-ip=10.10.10.2\x00--hostname-override=node2\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--config=/etc/kubernetes/kubelet-config.yaml\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--pod-infra-container-image=k8s.gcr.io/pause:3.1\x00--dynamic-config-dir=/etc/kubernetes/dynamic_kubelet_dir\x00--runtime-cgroups=/systemd/system.slice\x00--network-plugin=cni\x00--cni-conf-dir=/etc/cni/net.d\x00--cni-bin-dir=/opt/cni/bin\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2020-07-17T06:55:58Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2020-07-17T06:55:58Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2020-07-17T06:55:58Z" level=debug msg="Detected CSI Driver driver.longhorn.io CSI version v1.0.1-rc1 Kubernetes version v1.18.4 has already been deployed"
time="2020-07-17T06:55:59Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2020-07-17T06:58:13Z" level=fatal msg="Error deploying driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"

Environment:

Longhorn version: v1.0.1rc1
Kubernetes version: 1.8.4
Node OS type and version: Ubuntu 18.04.1 LTS (GNU/Linux 5.3.0-61-generic x86_64)

Additional context I also upgraded Kubernetes from 1.17.6 to 1.18.4 with Kubespray but the current issue seems not related…

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

@PhanLe1010 Long story short, I was upgrading the node1 with a drain. But, I could not continue my upgrade flawlessly using Kubespray. On kubespray, the recommended way to upgrade (the gracefull upgrade according the guide), is draining the node. [1]

I reported an issue with draining node with Longhorn : https://github.com/longhorn/longhorn/issues/1577

At the time when you upgraded Longhorn, I see that all 3 manager pods participated in the upgrading process with the manager on node3 became the first leader, then node1 became leader, then node2. It is weird bc usually only 1 manager pod leads the upgrading process.

It’s because I installed longhorn v1.0.0 before the Kubernetes upgrade from 1.17.6 to 1.18.4. After, I upgraded Longhorn using helm using the tag v1.0.1rc(1 or 2…). Logically, before my failing kubernetes upgrade, the leader was the node1 as you reported.

As a result, k8s starts killing longhorn components on node1 . This leads to a lot of communication issues, deadlocks as we can see in the logs of manager, csi-attacher, csi-resiser, csi-provisioner . This also makes longhorn-driver-deployer get into a crashloop because it cannot delete/create csi services/pods on node1

The node1 wasn’t recovered. Here, I think it will be good to have a timeout to retired a node from the cluster to not penalize longhorn maintenance when the cluster is on maintenance. We can imagine this issue complexity on a large cluster.

Finally, my suggestion is just to wait until the official v1.0.1 to release tomorrow (on Monday US PST time). Do another upgrade again. Everything should be back to normal because this time we don’t have any longhorn components on node1 to begin with. This is safe option because you will have to upgrade to v1.0.1 from the current RC release anyway so there is no risk here.

That’s fine. I will wait for the release !

[1] The kubespray doc related to k8s upgrade : https://github.com/kubernetes-sigs/kubespray/blob/master/docs/upgrades.md

PS : I feared creating a ticket, requesting help and saying I reset my setup finally is annoying. hehe… Thank you again for your patience. If there is any other issues, it will be a pleasure for me to report them and also contribute when it’s possible !

mikefaille on Jul 20, 2020

Hello @mikefaille! I am so sorry for the delay. But I have spent quite sometime dive into this problem. This is my best guess so far. Please correct me if I am wrong. (Note: When I refer to code, I am referring to the support bundle which you sent us)

First, I notice there are 3 nodes in this cluster: node1, node2, and node3
At the time when you upgraded Longhorn, I see that all 3 manager pods participated in the upgrading process with the manager on node3 became the first leader, then node1 became leader, then node2. It is weird bc usually only 1 manager pod leads the upgrading process.
Next, as we can see from the support bundle there are only logs for pods on node2 and node3 now. On the other hand, if we look into the nodes.yaml:line 51, 55 , we can see that you applied 2 taints to node1 with the effect NoSchedule and NoExecute .
This leads me to the conclusion that: before the upgrade, you use longhorn on 3 nodes. Then, you did Longhorn upgrading using the chart v1.0.1-rc1. Right after upgrading, maybe you realized that you only want 2 replicas and do not want longhorn component to run node1. Therefore, you immediately applied 2 taints to node1 with the effect NoSchedule and NoExecute .
As a result, k8s starts killing longhorn components on node1 . This leads to a lot of communication issues, deadlocks as we can see in the logs of manager, csi-attacher, csi-resiser, csi-provisioner . This also makes longhorn-driver-deployer get into a crashloop because it cannot delete/create csi services/pods on node1
Finally, my suggestion is just to wait until the official v1.0.1 to release tomorrow (on Monday US PST time). Do another upgrade again. Everything should be back to normal because this time we don’t have any longhorn components on node1 to begin with. This is safe option because you will have to upgrade to v1.0.1 from the current RC release anyway so there is no risk here.

But sure, creating a brand new Longhorn setup is also a good solution.

Thank you for using and reporting again!

PhanLe1010 on Jul 20, 2020

@mikefaille you can send the support bundle to longhorn-support-bundle@rancher.com.

yasker on Jul 18, 2020

Sure, I would like to take a deeper look at the problem. Could you please generate a support bundle using the link at the bottom left of Longhorn UI and send me us?

PhanLe1010 on Jul 18, 2020