longhorn: [BUG] High CPU usage when instance-manager-engine tries to shut down frontend device

Describe the bug Using Longhorn (have a 200 Gb volume via a storage class with 2 replicas), the instance-manager-e-xxxx pod on the main node tries to shut down frontend device, but fail, and it looks like it creates CPU Hog (instance-manaeger-engine reaching constant 2 CPU usage on a 4 CPUs instance). But the volume keeps working.

To Reproduce Steps to reproduce the behavior:

  1. Fresh install of Longhorn
  2. Have 3 Longhorn dedicated nodes to handle Longhorn volumes
  3. Label nodes with tags storage and slow, then disks with tag slow
  4. Create a new StorageClass (see below for details)
  5. Provision a volume
  6. Wait

Expected behavior Have a low CPU usage from pod instance-manager-e-xxxx (lower than consuming 2 CPU)

Log

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=error msg="Error when shutting down frontend:device pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78: fail to stop SCSI device: Fail to logout target: Failed to execute: nsenter [--mount=/host/proc/1889/ns/mnt --net=/host/proc/1889/ns/net iscsiadm -m node -o delete -p 10.42.11.30 -T iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78], output , stderr, iscsiadm: This command will remove the record [iface: default, target: iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78, portal: 10.42.11.30,3260], but a session is using it. Logout session then rerun command to remove record.\niscsiadm: Could not execute operation on all records: session exists\n, error exit status 15"

time="2020-06-29T06:49:51Z" level=info msg="Closing: 10.42.9.32:10015"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=info msg="Closing: 10.42.10.27:10000"

[pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07] time="2020-06-29T06:49:51Z" level=warning msg="Failed to execute hook github.com/longhorn/longhorn-engine/app/cmd.startController.func1: errors when shutting down controller: frontend: device pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78: fail to stop SCSI device: Fail to logout target: Failed to execute: nsenter [--mount=/host/proc/1889/ns/mnt --net=/host/proc/1889/ns/net iscsiadm -m node -o delete -p 10.42.11.30 -T iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78], output , stderr, iscsiadm: This command will remove the record [iface: default, target: iqn.2019-10.io.longhorn:pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78, portal: 10.42.11.30,3260], but a session is using it. Logout session then rerun command to remove record.\niscsiadm: Could not execute operation on all records: session exists\n, error exit status 15 backend: <nil>"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process Manager: wait for process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 to shutdown before unregistering process"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="Process Manager: process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07 error out, error msg: exit status 1"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process update: pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07: state error: Error: exit status 1"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=info msg="Process Manager: successfully unregistered process pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07"

[longhorn-instance-manager] time="2020-06-29T06:49:51Z" level=debug msg="Process update: pvc-4b94793f-b98e-11ea-a3d9-06bcb1d17c78-e-bbfedf07: state error: Error: exit status 1"

Environment:

  • Longhorn version: 1.0.0
  • Kubernetes version: 1.14.6
  • Node OS type and version: RancherOS 1.5.4, with Ubuntu console

Additional context

Storage Class

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.beta.kubernetes.io/is-default-class: "false"
  creationTimestamp: "2020-06-27T21:08:13Z"
  name: longhorn-slow
  resourceVersion: "184546402"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/longhorn-slow
  uid: 501789a2-b8ba-11ea-a8a5-06aa83693c38
allowVolumeExpansion: true
provisioner: driver.longhorn.io
reclaimPolicy: Delete
volumeBindingMode: Immediate
parameters:
  baseImage: ""
  diskSelector: slow
  fromBackup: ""
  nodeSelector: storage,slow
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (6 by maintainers)

Most upvoted comments

I’m having this same issue with the following setup:

Longhorn version: 1.2.3 Kubernetes version: 1.23.1 OS: Ubuntu 20.04

When I look at the node with high cpu I see a tgtd process utilizing 100% with errors very similar to those found here: https://github.com/longhorn/longhorn/issues/1533#issuecomment-651522821

I have the same issue with 100% CPU usage by tgtd process.

Longhorn version: 1.2.3 Kubernetes version: 1.21.8 OS: Ubuntu 20.04.3

cc @keithalucas @derekbit

@Wykiki Thanks for the suggestion. Now we’ve added a section to call out unsupported OS in the doc: https://longhorn.io/docs/1.0.2/best-practices/#oses-arent-supported-by-longhorn .

Ok thanks !

Anyway, I didn’t succeed to reproduce the same bug as mentionned in the thread, so I guess this issue can be closed, I’ll reopen one if I encounter this problem.

About RancherOS, I think that might be nice to mention that Longhorn does not work correctly on this OS, maybe on this page ? https://longhorn.io/docs/1.0.2/best-practices/ Even if you already mention Ubuntu and CentOS as recommended OS, the fact that Rancher supports RancherOS and Longhorn, and the Longhorn documentation not stating explicitly that RancherOS is not supported, make this kind of issue appear.