longhorn: [BUG] 1.5.0: AttachVolume.Attach failed for volume, the volume is currently attached to different node
Discussed in https://github.com/longhorn/longhorn/discussions/6281
<div type='discussions-op-text'>Originally posted by PatrickHuetter July 10, 2023 After upgrading to 1.5.0 it seems that I am seeing an increasing number of “AttachVolume.Attach failed for volume […]:, the volume is currently attached to different node […]” errors. Anybody else seeing that?</div>
The longhorn manager log shows
2023-07-11T00:58:21.954040163+02:00 W0710 22:58:21.953984 1 logging.go:59] [core] [Channel #436613 SubChannel #436614] grpc: addrConn.createTransport failed to connect to {
2023-07-11T00:58:21.954057356+02:00 "Addr": "10.42.1.96:8502",
2023-07-11T00:58:21.954059801+02:00 "ServerName": "10.42.1.96:8502",
2023-07-11T00:58:21.954061814+02:00 "Attributes": null,
2023-07-11T00:58:21.954063457+02:00 "BalancerAttributes": null,
2023-07-11T00:58:21.954065702+02:00 "Type": 0,
2023-07-11T00:58:21.954067385+02:00 "Metadata": null
2023-07-11T00:58:21.954069028+02:00 }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address"
2023-07-11T00:58:21.954202480+02:00 time="2023-07-10T22:58:21Z" level=error msg="Error syncing Longhorn engine" controller=longhorn-engine engine=longhorn-system/pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19 error="failed to sync engine for longhorn-system/pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19: failed to get Version of Instance Manager Disk Service Client for instance-manager-daec900e651b283e9d7a1c3e9e697855, state: running, IP: 10.42.1.96, TLS: false: failed to get disk service version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address\"" node=en360-k8s-ax2
2023-07-11T00:58:21.954257114+02:00 time="2023-07-10T22:58:21Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19\", UID:\"cd4312cc-1618-49d2-b3e6-3b78e6cd7dab\", APIVersion:\"longhorn.io/v1beta2\", ResourceVersion:\"606703948\", FieldPath:\"\"}): type: 'Warning' reason: 'FailedStarting' Error starting pvc-691c523a-20c7-4f90-a0a3-3a369dfeecb0-e-a3394e19: failed to get Version of Instance Manager Disk Service Client for instance-manager-daec900e651b283e9d7a1c3e9e697855, state: running, IP: 10.42.1.96, TLS: false: failed to get disk service version: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.42.1.96:8502: connect: cannot assign requested address\""
2023-07-11T00:58:21.973377802+02:00 W0710 22:58:21.973322 1 logging.go:59] [core] [Channel #436621 SubChannel #436622] grpc: addrConn.createTransport failed to connect to {
2023-07-11T00:58:21.973395685+02:00 "Addr": "10.42.1.96:8502",
2023-07-11T00:58:21.973399182+02:00 "ServerName": "10.42.1.96:8502",
2023-07-11T00:58:21.973402158+02:00 "Attributes": null,
2023-07-11T00:58:21.973404422+02:00 "BalancerAttributes": null,
Workaround
The workaround is restart longhorn-manager pod that failed to attach/detach a volume.
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 29 (23 by maintainers)
@innobead I think I was wrong in this comment https://github.com/longhorn/longhorn/issues/6287#issuecomment-1630102345. It should be:
The client side (longhorn-manager pod) runs out of port and cannot open a new connection to server-side (disk service) on the instance-manager pod. Either restarting longhorn-manager or instance-manager pod will close orphan connections. But restarting longhorn-manager pod is preferred since it doesn’t crash the replicas.
Server side is using a fixed port (8502) for TCP connection so it cannot run out of ports. However, the server side will allocate a Transmission Control Block (which contains server IP, server port, client IP, client port) for each connection. Overtime, when the number of connections increases, the number of Transmission Control Block block increases, and server side memory usage will increase. This could cause hight memory usage on instance-manager pods
Yes, over time, longhorn-manager will run out of ports and instance-manager will have high memory usage
Oh! OK. Should use
volumeattachments.longhorn.io(longhorn’s volumeAttachement) rather thantvolumeattachments(kubernetes volumeAttachments)From my side, yes.
Updated. The check can be done in longhorn-manager or instance-manager pods.
It makes sense, so it does mean this issue users would eventually encounter over time. So, for any engine operations, they would potentially have issues.
I think the server side (disk service) on the instance-manager pod runs out of port so the client (longhorn-manager) cannot open a new connection to it. Either restarting longhorn-manager or instance-manager pod will close orphan connections. But restarting longhorn-manager pod is preferred since it doesn’t crash the replicas
Already added.