longhorn: [BUG] 1.5.0: AttachVolume.Attach failed for volume, the volume is currently attached to different node

Discussed in https://github.com/longhorn/longhorn/discussions/6281

<div type='discussions-op-text'>

Originally posted by PatrickHuetter July 10, 2023 After upgrading to 1.5.0 it seems that I am seeing an increasing number of “AttachVolume.Attach failed for volume […]:, the volume is currently attached to different node […]” errors. Anybody else seeing that?</div>

The longhorn manager log shows

2023-07-11T00:58:21.954040163+02:00 W0710 22:58:21.953984       1 logging.go:59] [core] [Channel #436613 SubChannel #436614] grpc: addrConn.createTransport failed to connect to {
2023-07-11T00:58:21.954057356+02:00   "Addr": "10.42.1.96:8502",
2023-07-11T00:58:21.954059801+02:00   "ServerName": "10.42.1.96:8502",
2023-07-11T00:58:21.954061814+02:00   "Attributes": null,
2023-07-11T00:58:21.954063457+02:00   "BalancerAttributes": null,
2023-07-11T00:58:21.954065702+02:00   "Type": 0,
2023-07-11T00:58:21.954067385+02:00   "Metadata": null
2023-07-11T00:58:21.954069028+02:00 }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address"
2023-07-11T00:58:21.954202480+02:00 time="2023-07-10T22:58:21Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19 error="failed to sync engine for longhorn-system/pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19: failed to get Version of Instance Manager Disk Service Client for instance-manager-daec900e651b283e9d7a1c3e9e697855, state: running, IP: 10.42.1.96, TLS: false: failed to get disk service version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address\"" node=en360-k8s-ax2
2023-07-11T00:58:21.954257114+02:00 time="2023-07-10T22:58:21Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19\", UID:\"cd4312cc-1618-49d2-b3e6-3b78e6cd7dab\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"606703948\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedStarting' Error starting pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19: failed to get Version of Instance Manager Disk Service Client for instance-manager-daec900e651b283e9d7a1c3e9e697855, state: running, IP: 10.42.1.96, TLS: false: failed to get disk service version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address\""
2023-07-11T00:58:21.973377802+02:00 W0710 22:58:21.973322       1 logging.go:59] [core] [Channel #436621 SubChannel #436622] grpc: addrConn.createTransport failed to connect to {
2023-07-11T00:58:21.973395685+02:00   "Addr": "10.42.1.96:8502",
2023-07-11T00:58:21.973399182+02:00   "ServerName": "10.42.1.96:8502",
2023-07-11T00:58:21.973402158+02:00   "Attributes": null,
2023-07-11T00:58:21.973404422+02:00   "BalancerAttributes": null,

Workaround

The workaround is restart longhorn-manager pod that failed to attach/detach a volume.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 29 (23 by maintainers)

Most upvoted comments

@innobead I think I was wrong in this comment https://github.com/longhorn/longhorn/issues/6287#issuecomment-1630102345. It should be:

The client side (longhorn-manager pod) runs out of port and cannot open a new connection to server-side (disk service) on the instance-manager pod. Either restarting longhorn-manager or instance-manager pod will close orphan connections. But restarting longhorn-manager pod is preferred since it doesn’t crash the replicas.

Server side is using a fixed port (8502) for TCP connection so it cannot run out of ports. However, the server side will allocate a Transmission Control Block (which contains server IP, server port, client IP, client port) for each connection. Overtime, when the number of connections increases, the number of Transmission Control Block block increases, and server side memory usage will increase. This could cause hight memory usage on instance-manager pods

It makes sense, so it does mean this issue users would eventually encounter over time. So, for any engine operations, they would potentially have issues.

Yes, over time, longhorn-manager will run out of ports and instance-manager will have high memory usage

No, as I told you before, I really don’t see any of this ! The following returns nothing :

$ kubectl -n longhorn-system get volumeattachments -o yaml | grep ticket -i

Here is what I get when I access some volumeAttachments :

$ 
kubectl -n longhorn-system get volumeattachments csi-ffeaa30282a6d8a11b3e31ed368a6f71685b88e18c50d76a3399447f71e65e8f -o yaml
apiVersion: storage.k8s.io/v1
kind: VolumeAttachment
metadata:
  annotations:
    csi.alpha.kubernetes.io/node-id: gamora-node03.xxxx.net
  creationTimestamp: "2023-09-06T01:08:00Z"
  finalizers:
  - external-attacher/driver-longhorn-io
  name: csi-ffeaa30282a6d8a11b3e31ed368a6f71685b88e18c50d76a3399447f71e65e8f
  resourceVersion: "570706987"
  uid: f5f799b2-9903-4b93-8e0c-12ebfb2abb34
spec:
  attacher: driver.longhorn.io
  nodeName: gamora-node03.xxxx.net
  source:
    persistentVolumeName: pvc-eec020cb-9e94-462c-9a83-30cbab4d3a58
status:
  attached: true

Oh! OK. Should use volumeattachments.longhorn.io (longhorn’s volumeAttachement) rather thant volumeattachments (kubernetes volumeAttachments)

kubectl -n longhorn-system get volumeattachments.longhorn.io -o yaml

From my side, yes.

@derekbit Can you help describe this clearly where to check the port usage (as per the conversation in the source issue, it should be longhorn-manager pod)? so @longhorn/qa can help with the testing.

Updated. The check can be done in longhorn-manager or instance-manager pods.

I think the server side (disk service) on the instance-manager pod runs out of port so the client (longhorn-manager) cannot open a new connection to it. Either restarting longhorn-manager or instance-manager pod will close orphan connections. But restarting longhorn-manager pod is preferred since it doesn’t crash the replicas

It makes sense, so it does mean this issue users would eventually encounter over time. So, for any engine operations, they would potentially have issues.

I think the server side (disk service) on the instance-manager pod runs out of port so the client (longhorn-manager) cannot open a new connection to it. Either restarting longhorn-manager or instance-manager pod will close orphan connections. But restarting longhorn-manager pod is preferred since it doesn’t crash the replicas

@derekbit Please add this to the outstanding issue WIKI page for 1.5.0.

Already added.