longhorn: [BUG] High CPU usage when instance-manager-engine tries to shut down frontend device
Describe the bug
Using Longhorn (have a 200 Gb volume via a storage class with 2 replicas), the instance-manager-e-xxxx pod on the main node tries to shut down frontend device, but fail, and it looks like it creates CPU Hog (instance-manaeger-engine reaching constant 2 CPU usage on a 4 CPUs instance).
But the volume keeps working.
To Reproduce Steps to reproduce the behavior:
- Fresh install of Longhorn
- Have 3 Longhorn dedicated nodes to handle Longhorn volumes
- Label nodes with tags
storageandslow, then disks with tagslow - Create a new StorageClass (see below for details)
- Provision a volume
- Wait
Expected behavior
Have a low CPU usage from pod instance-manager-e-xxxx (lower than consuming 2 CPU)
Log
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=error msg="Error when shutting down frontend:device pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78: fail to stop SCSI device: Fail to logout target: Failed to execute: nsenter [--mount=/host/proc/1889/ns/mnt --net=/host/proc/1889/ns/net iscsiadm -m node -o delete -p 10.42.11.30 -T iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78], output , stderr, iscsiadm: This command will remove the record [iface: default, target: iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78, portal: 10.42.11.30,3260], but a session is using it. Logout session then rerun command to remove record.\niscsiadm: Could not execute operation on all records: session exists\n, error exit status 15"
time="2020-06-29T06:49:51Z" level=info msg="Closing: 10.42.9.32:10015"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=info msg="Closing: 10.42.10.27:10000"
[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=warning msg="Failed to execute hook github.com/longhorn/longhorn-engine/app/cmd.startController.func1: errors when shutting down controller: frontend: device pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78: fail to stop SCSI device: Fail to logout target: Failed to execute: nsenter [--mount=/host/proc/1889/ns/mnt --net=/host/proc/1889/ns/net iscsiadm -m node -o delete -p 10.42.11.30 -T iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78], output , stderr, iscsiadm: This command will remove the record [iface: default, target: iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78, portal: 10.42.11.30,3260], but a session is using it. Logout session then rerun command to remove record.\niscsiadm: Could not execute operation on all records: session exists\n, error exit status 15 backend: <nil>"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="Process Manager: process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 error out, error msg: exit status 1"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process update: pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07: state error: Error: exit status 1"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="Process Manager: successfully unregistered process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07"
[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process update: pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07: state error: Error: exit status 1"
Environment:
- Longhorn version: 1.0.0
- Kubernetes version: 1.14.6
- Node OS type and version: RancherOS 1.5.4, with Ubuntu console
Additional context
Storage Class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.beta.kubernetes.io/is-default-class: "false"
creationTimestamp: "2020-06-27T21:08:13Z"
name: longhorn-slow
resourceVersion: "184546402"
selfLink: /apis/storage.k8s.io/v1/storageclasses/longhorn-slow
uid: 501789a2-b8ba-11ea-a8a5-06aa83693c38
allowVolumeExpansion: true
provisioner: driver.longhorn.io
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
baseImage: ""
diskSelector: slow
fromBackup: ""
nodeSelector: storage,slow
numberOfReplicas: "2"
staleReplicaTimeout: "30"
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (6 by maintainers)
I’m having this same issue with the following setup:
Longhorn version: 1.2.3 Kubernetes version: 1.23.1 OS: Ubuntu 20.04
When I look at the node with high cpu I see a
tgtdprocess utilizing 100% with errors very similar to those found here: https://github.com/longhorn/longhorn/issues/1533#issuecomment-651522821@shuo-wu I’ve created a new issue https://github.com/longhorn/longhorn/issues/3636#issue-1141651746 to track this
cc @keithalucas @derekbit
@Wykiki Thanks for the suggestion. Now we’ve added a section to call out unsupported OS in the doc: https://longhorn.io/docs/1.0.2/best-practices/#oses-arent-supported-by-longhorn .
Ok thanks !
Anyway, I didn’t succeed to reproduce the same bug as mentionned in the thread, so I guess this issue can be closed, I’ll reopen one if I encounter this problem.
About RancherOS, I think that might be nice to mention that Longhorn does not work correctly on this OS, maybe on this page ? https://longhorn.io/docs/1.0.2/best-practices/ Even if you already mention Ubuntu and CentOS as recommended OS, the fact that Rancher supports RancherOS and Longhorn, and the Longhorn documentation not stating explicitly that RancherOS is not supported, make this kind of issue appear.