longhorn: [BUG] CSI plugin fails to initialize api client for longhorn-backend
Describe the bug I just added second node to my k3s cluster and can’t run all containers for longhorn. CSI-Plugin pod constantly times out and restarts. First k3s node is working fine with same pod.
kubectl get pods -o wide -n longhorn-system NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-attacher-75588bff58-5pjfk 1/1 Running 1 25m 10.42.0.52 n1 <none> <none>
csi-attacher-75588bff58-f67kd 1/1 Running 1 25m 10.42.0.51 n1 <none> <none>
csi-attacher-75588bff58-q6nx4 1/1 Running 1 25m 10.42.0.53 n1 <none> <none>
csi-provisioner-6968cf94f9-dcp7b 1/1 Running 1 25m 10.42.0.54 n1 <none> <none>
csi-provisioner-6968cf94f9-j5cmd 1/1 Running 1 25m 10.42.0.55 n1 <none> <none>
csi-provisioner-6968cf94f9-nbkq5 1/1 Running 1 25m 10.42.0.56 n1 <none> <none>
csi-resizer-5c88bfd4cf-hlqwh 1/1 Running 0 25m 10.42.0.59 n1 <none> <none>
csi-resizer-5c88bfd4cf-mbpnw 1/1 Running 0 25m 10.42.0.57 n1 <none> <none>
csi-resizer-5c88bfd4cf-p6lnf 1/1 Running 0 25m 10.42.0.58 n1 <none> <none>
csi-snapshotter-69f8bc8dcf-7fz5t 1/1 Running 0 25m 10.42.0.61 n1 <none> <none>
csi-snapshotter-69f8bc8dcf-dw7fw 1/1 Running 0 25m 10.42.0.62 n1 <none> <none>
csi-snapshotter-69f8bc8dcf-ltmm8 1/1 Running 0 25m 10.42.0.60 n1 <none> <none>
engine-image-ei-0f7c4304-pbf69 1/1 Running 0 12m 10.42.1.4 n2 <none> <none>
engine-image-ei-0f7c4304-v5nnf 1/1 Running 0 37m 10.42.0.42 n1 <none> <none>
engine-image-ei-a5a44787-mws54 1/1 Running 0 12m 10.42.1.2 n2 <none> <none>
engine-image-ei-a5a44787-nbmx8 1/1 Running 1 50m 10.42.0.16 n1 <none> <none>
instance-manager-e-49055f6a 1/1 Running 0 40m 10.42.0.39 n1 <none> <none>
instance-manager-e-545699dc 1/1 Running 0 37m 10.42.0.43 n1 <none> <none>
instance-manager-e-8fc14ec1 1/1 Running 0 12m 10.42.1.6 n2 <none> <none>
instance-manager-r-255c90fc 1/1 Running 0 37m 10.42.0.44 n1 <none> <none>
instance-manager-r-d3ee9ae1 1/1 Running 0 12m 10.42.1.7 n2 <none> <none>
instance-manager-r-f2cc20cc 1/1 Running 0 40m 10.42.0.35 n1 <none> <none>
longhorn-csi-plugin-9g5rn 2/2 CrashLoopBack 7 12m 10.42.1.3 n2 <none> <none>
longhorn-csi-plugin-d49p5 2/2 Running 0 21m 10.42.0.65 n1 <none> <none>
longhorn-driver-deployer-5f47f8c9c-lgb5k 1/1 Running 2 28m 10.42.0.45 n1 <none> <none>
longhorn-manager-k9ccq 1/1 Running 0 12m 10.42.1.5 n2 <none> <none>
longhorn-manager-ksv96 1/1 Running 0 37m 10.42.0.41 n1 <none> <none>
longhorn-ui-7545689d69-bc8gb 1/1 Running 0 28m 10.42.0.48 n1 <none> <none>
Logs from containers:
kubectl logs -n longhorn-system longhorn-csi-plugin-9g5rn -c node-driver-registrar
W1005 09:22:29.575542 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:39.575858 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:49.575449 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:22:59.574633 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:09.575168 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:19.575852 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:29.575340 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:39.575800 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:49.574964 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:23:59.575456 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
W1005 09:24:09.574841 70305 connection.go:173] Still connecting to unix:///csi/csi.sock
And the other one:
kubectl logs -n longhorn-system longhorn-csi-plugin-9g5rn longhorn-csi-plugin
2021/10/05 09:27:24 proto: duplicate proto type registered: VersionResponse
time="2021-10-05T09:27:24Z" level=info msg="CSI Driver: driver.longhorn.io csiVersion: 0.3.0, manager URL http://longhorn-backend:9500/v1"
time="2021-10-05T09:27:34Z" level=fatal msg="Error starting CSI manager: Failed to initialize Longhorn API client: Get \"http://longhorn-backend:9500/v1\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
To Reproduce This is plain install with two k3s nodes (both v.1.21.5 on x86). Longhorn worked perfectly on one node. Second node cannot connect and throws some timeouts in logs.
Expected behavior All pods should run without errors.
Log
Environment:
- Longhorn version: 1.2 (I tried 1.1.2 earlier)
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: k3s
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 1
- Node config
- OS type and version: debian bullseye on worker, debian buster on master
- CPU per node: 4
- Memory per node: 8GB
- Disk type(e.g. SSD/NVMe): ssd/nvme
- Network bandwidth between the nodes: 1GBps
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 2
Additional context This seems to be similar to #2225 (checked dns and they are same on two computers) Also #2619 seems same issue (nothing have been done there) #2647 describe same issue
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 23 (10 by maintainers)
I upgraded longhorn to 1.2.1 and still no luck, additionally ui pod jumped from one node to another (not master) and it failed to resolve longhorn-backend too. I logged to pods from first node and to same one on second node and it was obvious that something is wrong about name resolution on new node. Then I checked all things from Your link about domains and that helped me to resolve this issue. I needed to add
K3S_RESOLV_CONF=/etc/resolv.confto k3s env files on both nodes. After k3s up&restart it worked, it was not longhorn bug, rather k3sOSes are debian buster (master) and bullseye (worker). I will add more bullseye nodes in next few days and see id that flag is needed indeed. There is almost nothing except k3s and iscsi package on those hosts.
Thank you, I think we should add the above to our knowledge base doc since lots of issues are related to to DNS.
I was surprised how everything works without right DNS, in fact most services just don’t use it. Problem is not that visible and You can see on my example that even when You try to test everything You can miss such obvious thing 😉
Also my use case seems to be common. Most people will configure longhorn on cluster launch and then add charts and then nodes (when resources are running out). Im my case it worked until longhorn update (one pod failed, but everything else worked). On restart disaster came and none of services with pvc came up. My cluster was not production so this was not a big problem, but some people will be frustrated.
My lesson on this is that I need to find a way to test this on each node and make sure that dns is working correctly. If this depends on system, firewall or even network then I need periodically check and report that. I planned to rely on k3s-config check with awx, but it’s not enough, maybe it’s better to check some deamon set. If longhorn rely on naming then it should fail immediately with good error message or maybe some clue for docs. Maybe initContainer or health check is better? Still is should be better to avoid that problem with envs or maybe use that as failover, then naming problems would not affect longhorn at all. For such critical service it’s wise choice 😃 I will report iptables problem on debian11 in rancher repo and hope they will make warning about that in that check. This may help more people to avoid dns issues.
Hello, I just realized that my issue was not fixed with K3S_RESOLV_CONF change, probably when I made that change and restarted k3s it just moved longhorn pod to different node where dns were ok and everything came up with no problem (because longhorn-backend name was available there)
When I restarted cluster yesterday it recreated longhorn pods on problematic node and that caused PV problems on everything that uses longhorn (every app just failed without pvs). So I needed to check name resolution again. I quickly found out that none of internal/external dns were working on worker node. I checked k3s config file and it returned that everything should be ok, I decided to upgrade debian from 10 to 11 on master (I already planned that) and compared k3s-config check output, they were both same and ok (now I had both debian 11 bullseye).
Something was blocking packets from worker pods to master, those on dns queries. To see that I run busybox:
kubectl run busybox-debug -n longhorn-system -it --image=busybox --restart=Never --attach --overrides='{ \"spec\": { \"nodeName\": \"my-worker-node\" } }' --rm(replace my-worker-node with real name) Then nslookup with both longhorn-backend and something publik like google.com - both gave timeouts Traffic seemed to be blocked, so I needed to check iptables now.k3s script reported that it’s 1.8.7 and its legacy and it was out of box on debian 11. On 10 that needed to be changed and script warned about that. I installed both arp-tables and ebtables packeges (apt-get) and run:
After that change and k3s restart it worked!
My conclusion is that:
It’s not that hard to use those vars, any yams, app or service can easily pick them up and use. Cluster without DNS is not right thing but longhorn is the one that fails immediately on broken node. Interestingly nothing else relied on name resolution in this cluster.
I’ll reopen it until we add the knowledge base doc for this.