longhorn: [BUG] Instance Manager is running out of pthread resources
Describe the bug
Instance manager fails to allocate new replica reporting runtime/cgo: pthread_create failed: Resource temporarily unavailable.
To Reproduce
After a while of normal operating state. Creating a PVC in longhorn related storage class. The volume is successfully created and attached but remains in Degraded state with on replicas that won’t initialize.
Expected behavior
The volume is in Healthy state.
Log
This is the error from one of instance-manager-r pod (the one on the node that did not initialize the replica) that is also reported in longhorn-manager :
[longhorn-instance-manager] time="2021-10-22T06:14:42Z" level=info msg="wait for gRPC service of process pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2 to start at localhost:10570"
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] runtime/cgo: pthread_create failed: Resource temporarily unavailable
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] SIGABRT: abort
PC=0xbfdd0b m=0 sigcode=18446744073709551610
goroutine 0 [idle]:
runtime: unknown pc 0xbfdd0b
stack: frame={sp:0x7ffe37bcd990, fp:0x0} stack=[0x7ffe373cef18,0x7ffe37bcdf50)
00007ffe37bcd890: 0000000000210808 0000002200000003
00007ffe37bcd8a0: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00000000ffffffff 00007fe7d7872000
00007ffe37bcd8b0: 00007ffe37bcd8f8 00007ffe37bcd8e0
00007ffe37bcd8c0: 00007ffe37bcd8f0 0000000000bb775c
00007ffe37bcd8d0: 0000000000000000 00000000004680ce <runtime.callCgoMmap+62>
00007ffe37bcd8e0: 00007ffe37bcd8e0 0000000000000000
00007ffe37bcd8f0: 00007ffe37bcd930 000000000045fe38 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] <runtime.mmap.func1+88>
00007ffe37bcd900: 00007fe7e9c03000 0000000000001000
00007ffe37bcd910[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 0000003200000003 00000000ffffffff
00007ffe37bcd920: 00007fe7e9c03000 00007ffe37bcd970
00007ffe37bcd930: 00007ffe37bcd9a8 0000000000404d3e[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] <runtime.mmap+158>
00007ffe37bcd940: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd978 00007ffe37bcd978
00007ffe37bcd950: 00007ffe37bcd988 0000000000bb775c[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
00007ffe37bcd960: 00007fe7d7872000 00007ffe37bcd998 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
00007ffe37bcd970: 00007ffe37bcd9a8 0000000000bb775c
00007ffe37bcd980[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 00007fe7e9c03000 00000000004680ce <[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] runtime.callCgoMmap+62>
00007ffe37bcd990: <0000000000000000[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0000000000000000
00007ffe37bcd9a0: 0000000000100000 00007ffe37bcd9d0
00007ffe37bcd9b0: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd9e0 00007ffe37bcda80
00007ffe37bcd9c0: 000000000042a48c <runtime.(*pageAlloc).update+604[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] > 00007ffe37bcda90
00007ffe37bcd9d0: 000000000042a48c <[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] runtime.(*pageAlloc).update+604> 00007fe7fe303c00
00007ffe37bcd9e0: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0000000000000008 000000000000fe80
00007ffe37bcd9f0[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 0000000000000012 000000003b600000
00007ffe37bcda00[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 000000003c000000 000780003c000000
00007ffe37bcda10[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : fffffffe7fffffff ffffffffffffffff
00007ffe37bcda20[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : ffffffffffffffff ffffffffffffffff
00007ffe37bcda30: ffffffffffffffff ffffffffffffffff [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
00007ffe37bcda40: ffffffffffffffff ffffffffffffffff
00007ffe37bcda50: ffffffffffffffff [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff
00007ffe37bcda60: ffffffffffffffff [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff
00007ffe37bcda70: ffffffffffffffff [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff
00007ffe37bcda80: ffffffffffffffff [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff
runtime: unknown pc 0xbfdd0b
stack: frame={sp:[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0x7ffe37bcd990, fp:0x0} stack=[0x7ffe373cef18,0x7ffe37bcdf50)
00007ffe37bcd890[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 0000000000210808 0000002200000003
00007ffe37bcd8a0: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00000000ffffffff 00007fe7d7872000
00007ffe37bcd8b0: 00007ffe37bcd8f8 00007ffe37bcd8e0
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd8c0: 00007ffe37bcd8f0 0000000000bb775c
00007ffe37bcd8d0[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 0000000000000000 00000000004680ce <[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] runtime.callCgoMmap+62>
00007ffe37bcd8e0: 00007ffe37bcd8e0 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0000000000000000
00007ffe37bcd8f0: 00007ffe37bcd930 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 000000000045fe38 <runtime.mmap.func1+88>
00007ffe37bcd900: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007fe7e9c03000 0000000000001000
00007ffe37bcd910[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 0000003200000003 00000000ffffffff
00007ffe37bcd920[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : 00007fe7e9c03000 00007ffe37bcd970
00007ffe37bcd930: 00007ffe37bcd9a8 0000000000404d3e [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] <runtime.mmap+158>
00007ffe37bcd940: 00007ffe37bcd978 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd978
00007ffe37bcd950: 00007ffe37bcd988 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0000000000bb775c
00007ffe37bcd960: 00007fe7d7872000 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd998
00007ffe37bcd970: 00007ffe37bcd9a8 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0000000000bb775c
00007ffe37bcd980: 00007fe7e9c03000 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00000000004680ce <runtime.callCgoMmap+62>
00007ffe37bcd990[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] : <0000000000000000 0000000000000000
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd9a0: 0000000000100000 00007ffe37bcd9d0
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcd9b0: 00007ffe37bcd9e0 00007ffe37bcda80
00007ffe37bcd9c0: 000000000042a48c <runtime.(*pageAlloc).update+604> 00007ffe37bcda90 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
00007ffe37bcd9d0: 000000000042a48c <runtime.(*pageAlloc).update+604> [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007fe7fe303c00
00007ffe37bcd9e0: 0000000000000008 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 000000000000fe80
00007ffe37bcd9f0: 0000000000000012 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 000000003b600000
00007ffe37bcda00: 000000003c000000 000780003c000000
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcda10: fffffffe7fffffff ffffffffffffffff
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcda20: ffffffffffffffff ffffffffffffffff
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 00007ffe37bcda30: ffffffffffffffff ffffffffffffffff
00007ffe37bcda40: ffffffffffffffff ffffffffffffffff[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
00007ffe37bcda50: ffffffffffffffff ffffffffffffffff
00007ffe37bcda60: [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff ffffffffffffffff
00007ffe37bcda70: ffffffffffffffff[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] ffffffffffffffff
00007ffe37bcda80: ffffffffffffffff ffffffffffffffff
goroutine 1 [running]:
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] runtime.systemstack_switch()
/usr/local/go/src/runtime/asm_amd64.s:330 fp=0xc00004e788 sp=0xc00004e780 pc=0x463f30
runtime.main()
/usr/local/go/src/runtime/proc.go[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] :133 +0x70 fp=0xc00004e7e0 sp=0xc00004e788 pc=0x437910
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0xc00004e7e8 sp=0xc00004e7e0 pc=0x466041
rax 0x0
rbx [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0x242a880
rcx 0xbfdd0b
rdx 0x0
rdi [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0x2
rsi 0x7ffe37bcd990
rbp 0x10c3fdb
rsp [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0x7ffe37bcd990
r8 0x0
r9 0x7ffe37bcd990
r10 [pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] 0x8
r11 0x246
r12 0x242bbf0
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] r13 0x0
r14 0x105f81c
r15 0x0
[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2] rip 0xbfdd0b
rflags 0x246
cs 0x33[pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2]
fs 0x0
gs 0x0
[longhorn-instance-manager] time="2021-10-22T06:14:42Z" level=info msg="Process Manager: process pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2 error out, error msg: exit status 2"
[longhorn-instance-manager] time="2021-10-22T06:14:42Z" level=debug msg="Process update: pvc-f04220b3-0f93-458e-a17a-f4137f3c5d62-r-f797a9f2: state error: Error: exit status 2"
Environment:
- Longhorn version: v1.2.2
- Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
- Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: Vanilla
- Number of management node in the cluster: 1
- Number of worker node in the cluster: 5 (but only 3 for Longhorn)
- Node config
- OS type and version: Arch Linux with Kernel 5.14.11 / GLibC 2.33
- Container Runtime : cri-o 1.22 / runc 1.0.2
- CPU per node: 4
- Memory per node: 32 GB
- Disk type(e.g. SSD/NVMe): Mixed NVMe and HDD with two storageclasses using disklabels.
- Network bandwidth between the nodes: 1000BASTX
- Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
- Number of Longhorn volumes in the cluster: 34
Additional context
With a small sized cluster (3 bare-metal hosts with 4-core CPU / 32 Go RAM / 1TB storage) hosting 34 volumes. All systems OK for 7 days and then creating a new volume : the volume enters a Degraded state and one of the replicas could not be scheduled. Looking at the instance-manager-r pod logs show the error above. When restarting the affected pod : all volumes went in Degraded state and full resync was triggered. But, at the end of the process, another (not the same) instance-manager-r reports the same error. After waiting the sync to go as far as it cans and restarting all instance-manager-r pods, the cluster went back in stable state.
Seems to be a goroutines leak race condition ?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 1
- Comments: 16 (6 by maintainers)
I just hit the problem when scaling up my cluster to 5 nodes 23 volumes 3 replicas, using crio too.
and restarting controllers fixed the problem on my side
So, more than 20 days without any new incident. It seems to be stable now with Longhorn v1.2.3 and correct
pids_limit.Don’t know which is the correct value but it seems to be working for @n0rad with 2048.
Closing now. Thanks all for your help.
Same here (with still 2048) after restarting all nodes to upgrade kube with now 14 to 16 days uptime. It was maybe because restarting controllers only was just not enough.
No incident for 17 days now using :
I will now upgrade to 1.2.3 and so reset this counter 😛
I think I shout victory too fast. After 24hours, the problem comes again on multiple nodes 😞
Here the results on affected node epimethee today :
https://github.com/fkocik/lh-support/blob/main/lh-trace.tar.gz?raw=true
It’s a bare-metal setup so I do not have enough resources to reproduce on another Linux distro. I can try to downgrade kernel in order to switch to LTS stream. Don’t know if it may help ?
Another interesting result I found is this parameter of the runtime engine (CRI-O) :
Do you think it can be the root cause ?