longhorn: [BUG] After an upgrade of longhorn from v1.0.0 to v1.0.1rc longhorn-driver-deployer is on crashloop.
Describe the bug
After an upgrade of longhorn from v1.0.0 to v1.0.1rc1 longhorn-driver-deployer is on crashloop. Verify the logs at the end of this issue write up.
Beside that, I’m able to use the storageclass. I admit I’m not sure if an issue with longhorn-driver-deployer can cause more problems later.
To Reproduce Steps to reproduce the behavior:
- Install longhorn v1.0.0 from helm
- Upgrade longhorn to v1.0.1 with helm
- See error using
Expected behavior
Log
kubectl logs -n longhorn-system longhorn-driver-deployer-8547759f46-6r2qd -p
time="2020-07-17T06:55:42Z" level=debug msg="Deploying CSI driver"
time="2020-07-17T06:55:44Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:45Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:47Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:48Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:49Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:50Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:51Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Pending"
time="2020-07-17T06:55:52Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:53Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:54Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:55Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:57Z" level=debug msg="proc cmdline detection pod discover-proc-kubelet-cmdline in phase: Running"
time="2020-07-17T06:55:58Z" level=info msg="Proc found: kubelet"
time="2020-07-17T06:55:58Z" level=info msg="Try to find arg [--root-dir] in cmdline: [/usr/local/bin/kubelet --logtostderr=true --v=2 --node-ip=10.10.10.2 --hostname-override=node2 --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --config=/etc/kubernetes/kubelet-config.yaml --kubeconfig=/etc/kubernetes/kubelet.conf --pod-infra-container-image=k8s.gcr.io/pause:3.1 --dynamic-config-dir=/etc/kubernetes/dynamic_kubelet_dir --runtime-cgroups=/systemd/system.slice --network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin ]"
time="2020-07-17T06:55:58Z" level=warning msg="Cmdline of proc kubelet found: \"/usr/local/bin/kubelet\x00--logtostderr=true\x00--v=2\x00--node-ip=10.10.10.2\x00--hostname-override=node2\x00--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf\x00--config=/etc/kubernetes/kubelet-config.yaml\x00--kubeconfig=/etc/kubernetes/kubelet.conf\x00--pod-infra-container-image=k8s.gcr.io/pause:3.1\x00--dynamic-config-dir=/etc/kubernetes/dynamic_kubelet_dir\x00--runtime-cgroups=/systemd/system.slice\x00--network-plugin=cni\x00--cni-conf-dir=/etc/cni/net.d\x00--cni-bin-dir=/opt/cni/bin\x00\". But arg \"--root-dir\" not found. Hence default value will be used: \"/var/lib/kubelet\""
time="2020-07-17T06:55:58Z" level=info msg="Detected root dir path: /var/lib/kubelet"
time="2020-07-17T06:55:58Z" level=info msg="Upgrading Longhorn related components for CSI v1.1.0"
time="2020-07-17T06:55:58Z" level=debug msg="Detected CSI Driver driver.longhorn.io CSI version v1.0.1-rc1 Kubernetes version v1.18.4 has already been deployed"
time="2020-07-17T06:55:59Z" level=debug msg="Waiting for foreground deletion of service csi-attacher"
time="2020-07-17T06:58:13Z" level=fatal msg="Error deploying driver: failed to deploy service csi-attacher: failed to cleanup service csi-attacher: Foreground deletion of service csi-attacher timed out"
Environment:
- Longhorn version: v1.0.1rc1
- Kubernetes version: 1.8.4
- Node OS type and version:
Ubuntu 18.04.1 LTS (GNU/Linux 5.3.0-61-generic x86_64)
Additional context I also upgraded Kubernetes from 1.17.6 to 1.18.4 with Kubespray but the current issue seems not related…
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (8 by maintainers)
@PhanLe1010 Long story short, I was upgrading the node1 with a drain. But, I could not continue my upgrade flawlessly using Kubespray. On kubespray, the recommended way to upgrade (the gracefull upgrade according the guide), is draining the node. [1]
I reported an issue with draining node with Longhorn : https://github.com/longhorn/longhorn/issues/1577
It’s because I installed longhorn v1.0.0 before the Kubernetes upgrade from 1.17.6 to 1.18.4. After, I upgraded Longhorn using helm using the tag v1.0.1rc(1 or 2…). Logically, before my failing kubernetes upgrade, the leader was the node1 as you reported.
The node1 wasn’t recovered. Here, I think it will be good to have a timeout to retired a node from the cluster to not penalize longhorn maintenance when the cluster is on maintenance. We can imagine this issue complexity on a large cluster.
That’s fine. I will wait for the release !
[1] The kubespray doc related to k8s upgrade : https://github.com/kubernetes-sigs/kubespray/blob/master/docs/upgrades.md
PS : I feared creating a ticket, requesting help and saying I reset my setup finally is annoying. hehe… Thank you again for your patience. If there is any other issues, it will be a pleasure for me to report them and also contribute when it’s possible !
Hello @mikefaille! I am so sorry for the delay. But I have spent quite sometime dive into this problem. This is my best guess so far. Please correct me if I am wrong. (Note: When I refer to code, I am referring to the support bundle which you sent us)
node1,node2, andnode3node3became the first leader, thennode1became leader, thennode2. It is weird bc usually only 1 manager pod leads the upgrading process.node2andnode3now. On the other hand, if we look into thenodes.yaml:line 51, 55, we can see that you applied 2 taints tonode1with the effectNoScheduleandNoExecute.v1.0.1-rc1. Right after upgrading, maybe you realized that you only want 2 replicas and do not want longhorn component to runnode1. Therefore, you immediately applied 2 taints tonode1with the effectNoScheduleandNoExecute.node1. This leads to a lot of communication issues, deadlocks as we can see in the logs ofmanager,csi-attacher,csi-resiser,csi-provisioner. This also makeslonghorn-driver-deployerget into a crashloop because it cannot delete/create csi services/pods onnode1node1to begin with. This is safe option because you will have to upgrade to v1.0.1 from the current RC release anyway so there is no risk here.But sure, creating a brand new Longhorn setup is also a good solution.
Thank you for using and reporting again!
@mikefaille you can send the support bundle to
longhorn-support-bundle@rancher.com.Sure, I would like to take a deeper look at the problem. Could you please generate a support bundle using the link at the bottom left of Longhorn UI and send
meus?