longhorn: [BUG] CSI plugin fails to initialize api client for longhorn-backend

Describe the bug I just added second node to my k3s cluster and can’t run all containers for longhorn. CSI-Plugin pod constantly times out and restarts. First k3s node is working fine with same pod.

kubectl get pods -o wide -n longhorn-system                                                                          NAME                                       READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES
csi-attacher-75588bff58-5pjfk              1/1     Running       1          25m   10.42.0.52   n1       <none>           <none>
csi-attacher-75588bff58-f67kd              1/1     Running       1          25m   10.42.0.51   n1       <none>           <none>
csi-attacher-75588bff58-q6nx4              1/1     Running       1          25m   10.42.0.53   n1       <none>           <none>
csi-provisioner-6968cf94f9-dcp7b           1/1     Running       1          25m   10.42.0.54   n1       <none>           <none>
csi-provisioner-6968cf94f9-j5cmd           1/1     Running       1          25m   10.42.0.55   n1       <none>           <none>
csi-provisioner-6968cf94f9-nbkq5           1/1     Running       1          25m   10.42.0.56   n1       <none>           <none>
csi-resizer-5c88bfd4cf-hlqwh               1/1     Running       0          25m   10.42.0.59   n1       <none>           <none>
csi-resizer-5c88bfd4cf-mbpnw               1/1     Running       0          25m   10.42.0.57   n1       <none>           <none>
csi-resizer-5c88bfd4cf-p6lnf               1/1     Running       0          25m   10.42.0.58   n1       <none>           <none>
csi-snapshotter-69f8bc8dcf-7fz5t           1/1     Running       0          25m   10.42.0.61   n1       <none>           <none>
csi-snapshotter-69f8bc8dcf-dw7fw           1/1     Running       0          25m   10.42.0.62   n1       <none>           <none>
csi-snapshotter-69f8bc8dcf-ltmm8           1/1     Running       0          25m   10.42.0.60   n1       <none>           <none>
engine-image-ei-0f7c4304-pbf69             1/1     Running       0          12m   10.42.1.4    n2       <none>           <none>
engine-image-ei-0f7c4304-v5nnf             1/1     Running       0          37m   10.42.0.42   n1       <none>           <none>
engine-image-ei-a5a44787-mws54             1/1     Running       0          12m   10.42.1.2    n2       <none>           <none>
engine-image-ei-a5a44787-nbmx8             1/1     Running       1          50m   10.42.0.16   n1       <none>           <none>
instance-manager-e-49055f6a                1/1     Running       0          40m   10.42.0.39   n1       <none>           <none>
instance-manager-e-545699dc                1/1     Running       0          37m   10.42.0.43   n1       <none>           <none>
instance-manager-e-8fc14ec1                1/1     Running       0          12m   10.42.1.6    n2       <none>           <none>
instance-manager-r-255c90fc                1/1     Running       0          37m   10.42.0.44   n1       <none>           <none>
instance-manager-r-d3ee9ae1                1/1     Running       0          12m   10.42.1.7    n2       <none>           <none>
instance-manager-r-f2cc20cc                1/1     Running       0          40m   10.42.0.35   n1       <none>           <none>
longhorn-csi-plugin-9g5rn                  2/2     CrashLoopBack 7          12m   10.42.1.3    n2       <none>           <none>
longhorn-csi-plugin-d49p5                  2/2     Running       0          21m   10.42.0.65   n1       <none>           <none>
longhorn-driver-deployer-5f47f8c9c-lgb5k   1/1     Running       2          28m   10.42.0.45   n1       <none>           <none>
longhorn-manager-k9ccq                     1/1     Running       0          12m   10.42.1.5    n2       <none>           <none>
longhorn-manager-ksv96                     1/1     Running       0          37m   10.42.0.41   n1       <none>           <none>
longhorn-ui-7545689d69-bc8gb               1/1     Running       0          28m   10.42.0.48   n1       <none>           <none>

Logs from containers:

kubectl logs -n longhorn-system longhorn-csi-plugin-9g5rn -c node-driver-registrar 
W1005 09:22:29.575542   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:39.575858   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:49.575449   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:59.574633   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:09.575168   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:19.575852   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:29.575340   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:39.575800   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:49.574964   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:59.575456   70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:24:09.574841   70305 connection.go:173] Still connecting to unix:///csi/csi.sock

And the other one:

kubectl logs -n longhorn-system longhorn-csi-plugin-9g5rn longhorn-csi-plugin                                        
2021/10/05 09:27:24 proto: duplicate proto type registered: VersionResponse
time="2021-10-05T09:27:24Z" level=info msg="CSI Driver: driver.longhorn.io csiVersion: 0.3.0, manager URL http://longhorn-backend:9500/v1"
time="2021-10-05T09:27:34Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get \"http://longhorn-backend:9500/v1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"

To Reproduce This is plain install with two k3s nodes (both v.1.21.5 on x86). Longhorn worked perfectly on one node. Second node cannot connect and throws some timeouts in logs.

Expected behavior All pods should run without errors.

Log

Environment:

  • Longhorn version: 1.2 (I tried 1.1.2 earlier)
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
    • Number of management node in the cluster: 1
    • Number of worker node in the cluster: 1
  • Node config
    • OS type and version: debian bullseye on worker, debian buster on master
    • CPU per node: 4
    • Memory per node: 8GB
    • Disk type(e.g. SSD/NVMe): ssd/nvme
    • Network bandwidth between the nodes: 1GBps
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
  • Number of Longhorn volumes in the cluster: 2

Additional context This seems to be similar to #2225 (checked dns and they are same on two computers) Also #2619 seems same issue (nothing have been done there) #2647 describe same issue

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 1
  • Comments: 23 (10 by maintainers)

Most upvoted comments

I upgraded longhorn to 1.2.1 and still no luck, additionally ui pod jumped from one node to another (not master) and it failed to resolve longhorn-backend too. I logged to pods from first node and to same one on second node and it was obvious that something is wrong about name resolution on new node. Then I checked all things from Your link about domains and that helped me to resolve this issue. I needed to add K3S_RESOLV_CONF=/etc/resolv.conf to k3s env files on both nodes. After k3s up&restart it worked, it was not longhorn bug, rather k3s

OSes are debian buster (master) and bullseye (worker). I will add more bullseye nodes in next few days and see id that flag is needed indeed. There is almost nothing except k3s and iscsi package on those hosts.

Thank you, I think we should add the above to our knowledge base doc since lots of issues are related to to DNS.

I was surprised how everything works without right DNS, in fact most services just don’t use it. Problem is not that visible and You can see on my example that even when You try to test everything You can miss such obvious thing 😉

Also my use case seems to be common. Most people will configure longhorn on cluster launch and then add charts and then nodes (when resources are running out). Im my case it worked until longhorn update (one pod failed, but everything else worked). On restart disaster came and none of services with pvc came up. My cluster was not production so this was not a big problem, but some people will be frustrated.

My lesson on this is that I need to find a way to test this on each node and make sure that dns is working correctly. If this depends on system, firewall or even network then I need periodically check and report that. I planned to rely on k3s-config check with awx, but it’s not enough, maybe it’s better to check some deamon set. If longhorn rely on naming then it should fail immediately with good error message or maybe some clue for docs. Maybe initContainer or health check is better? Still is should be better to avoid that problem with envs or maybe use that as failover, then naming problems would not affect longhorn at all. For such critical service it’s wise choice 😃 I will report iptables problem on debian11 in rancher repo and hope they will make warning about that in that check. This may help more people to avoid dns issues.

Hello, I just realized that my issue was not fixed with K3S_RESOLV_CONF change, probably when I made that change and restarted k3s it just moved longhorn pod to different node where dns were ok and everything came up with no problem (because longhorn-backend name was available there)

When I restarted cluster yesterday it recreated longhorn pods on problematic node and that caused PV problems on everything that uses longhorn (every app just failed without pvs). So I needed to check name resolution again. I quickly found out that none of internal/external dns were working on worker node. I checked k3s config file and it returned that everything should be ok, I decided to upgrade debian from 10 to 11 on master (I already planned that) and compared k3s-config check output, they were both same and ok (now I had both debian 11 bullseye).

image

Something was blocking packets from worker pods to master, those on dns queries. To see that I run busybox:

kubectl run busybox-debug -n longhorn-system -it --image=busybox --restart=Never --attach --overrides='{ \"spec\": { \"nodeName\": \"my-worker-node\" } }' --rm (replace my-worker-node with real name) Then nslookup with both longhorn-backend and something publik like google.com - both gave timeouts Traffic seemed to be blocked, so I needed to check iptables now.

k3s script reported that it’s 1.8.7 and its legacy and it was out of box on debian 11. On 10 that needed to be changed and script warned about that. I installed both arp-tables and ebtables packeges (apt-get) and run:

update-alternatives --set iptables /usr/sbin/iptables-legacy
update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
update-alternatives --set arptables /usr/sbin/arptables-legacy
update-alternatives --set ebtables /usr/sbin/ebtables-legacy

After that change and k3s restart it worked!

My conclusion is that:

  • on debian 11 still You need to switch all iptables to legacy, I knew that for debian 10
  • k3s check script should report original configuration as faulty
  • longhorn should no rely on dns name services if its possible. Such problem is not imidiately visible and after restart all application with pvc fails, most services rely on ENV variables and longhorn has them too:

image

It’s not that hard to use those vars, any yams, app or service can easily pick them up and use. Cluster without DNS is not right thing but longhorn is the one that fails immediately on broken node. Interestingly nothing else relied on name resolution in this cluster.

I’ll reopen it until we add the knowledge base doc for this.