longhorn: [BUG] fatal error: concurrent map writes
Describe the bug
During normal execution, a longhorn-manager instance throws a go panic and exits with the following message and a huge stack trace: fatal error: concurrent map writes.
This has occurred 6 times in the last 24 hours on my cluster, on random nodes.
To Reproduce Steps to reproduce the behavior:
- Create a 1.2.0 Longhorn cluster with volumes
Watch the logs of longhorn-manager containersWait 6 hours- Observe the error occurs on a random node
Expected behavior Longhorn-manager should not throw go panics and the container should not restart unexpectedly.
Log
There are a large number of goroutines listed in the panic, I included only the first.
fatal error: concurrent map writes
fatal error: concurrent map writes
goroutine 1215482 [running]:
runtime.throw(0x1b8963e, 0x15)
/usr/local/go/src/runtime/panic.go:1114 +0x72 fp=0xc001623760 sp=0xc001623730 pc=0x4348e2
runtime.mapassign_faststr(0x19023e0, 0xc0030a99b0, 0x1b79e12, 0xa, 0x24)
/usr/local/go/src/runtime/map_faststr.go:291 +0x3de fp=0xc0016237c8 sp=0xc001623760 pc=0x41359e
github.com/rancher/go-rancher/api.populateSchema(0xc002e80440, 0xc0030c7a20)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:101 +0xdf fp=0xc001623818 sp=0xc0016237c8 pc=0x138a79f
github.com/rancher/go-rancher/api.SchemasHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:93 +0xf6 fp=0xc001623878 sp=0xc001623818 pc=0x138ec06
net/http.HandlerFunc.ServeHTTP(0xc00127dd90, 0x1e920a0, 0xc000859880, 0xc00356cd00)
/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc0016238a0 sp=0xc001623878 pc=0x7e7284
github.com/rancher/go-rancher/api.ApiHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:26 +0x13b fp=0xc0016238f8 sp=0xc0016238a0 pc=0x138e27b
net/http.HandlerFunc.ServeHTTP(0xc0030add80, 0x1e920a0, 0xc000859880, 0xc00356cd00)
/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623920 sp=0xc0016238f8 pc=0x7e7284
github.com/gorilla/context.ClearHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/context/context.go:141 +0x74 fp=0xc001623968 sp=0xc001623920 pc=0x1388b74
net/http.HandlerFunc.ServeHTTP(0xc0030adda0, 0x1e920a0, 0xc000859880, 0xc00356cd00)
/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623990 sp=0xc001623968 pc=0x7e7284
github.com/gorilla/mux.(*Router).ServeHTTP(0xc00030d380, 0x1e920a0, 0xc000859880, 0xc00356cb00)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/mux/mux.go:212 +0xe2 fp=0xc001623ac0 sp=0xc001623990 pc=0x137d9c2
github.com/longhorn/longhorn-manager/util.filteredLoggingHandler.ServeHTTP(0xc0033a5a40, 0x1e4ca60, 0xc00030d380, 0x1e4f040, 0xc0033a5a70, 0x1e920a0, 0xc000859880, 0xc00356cb00)
/go/src/github.com/longhorn/longhorn-manager/util/util.go:550 +0xe4 fp=0xc001623b00 sp=0xc001623ac0 pc=0x12c39b4
github.com/longhorn/longhorn-manager/util.(*filteredLoggingHandler).ServeHTTP(0xc0033a5aa0, 0x1e920a0, 0xc000859880, 0xc00356cb00)
<autogenerated>:1 +0x9b fp=0xc001623b78 sp=0xc001623b00 pc=0x12c824b
github.com/gorilla/handlers.ProxyHeaders.func1(0x1e920a0, 0xc000859880, 0xc00356cb00)
/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/handlers/proxy_headers.go:59 +0x10c fp=0xc001623bc0 sp=0xc001623b78 pc=0x12b39bc
net/http.HandlerFunc.ServeHTTP(0xc0030acac0, 0x1e920a0, 0xc000859880, 0xc00356cb00)
/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623be8 sp=0xc001623bc0 pc=0x7e7284
net/http.serverHandler.ServeHTTP(0xc0002840e0, 0x1e920a0, 0xc000859880, 0xc00356cb00)
/usr/local/go/src/net/http/server.go:2807 +0xa3 fp=0xc001623c18 sp=0xc001623be8 pc=0x7ea613
net/http.(*conn).serve(0xc001fd6460, 0x1e96c60, 0xc002e80300)
/usr/local/go/src/net/http/server.go:1895 +0x86c fp=0xc001623fc8 sp=0xc001623c18 pc=0x7e608c
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc001623fd0 sp=0xc001623fc8 pc=0x466dc1
created by net/http.(*Server).Serve
/usr/local/go/src/net/http/server.go:2933 +0x35c
Support Bundle
longhorn-support-bundle_5be709ef-a54d-46ea-971f-a9f7c4fb4b26_2021-09-28T04-03-38Z.zip
Environment:
Longhorn version: 1.2.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm
Kubernetes version: 1.20.10
Number of management node in the cluster: 3
Number of worker node in the cluster: 6
Node config
OS type and version: Ubuntu 20.04.1 LTS
CPU per node: 8
Memory per node: 16GB (plus 1x32GB)
Disk type(e.g. SSD/NVMe): NVMe
Network bandwidth between the nodes: 2.5Gbit
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 22
Additional context N/A
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 4
- Comments: 15 (11 by maintainers)
Root cause
Concurrent map writes happened in
SchemasHandler->populateSchema.SchemaHandler->populateSchema.https://github.com/rancher/go-rancher/blob/master/api/handler.go#L82
Reproduce steps
crictl logs --follow <longhorn-manger-container-name>bash run.sh &or
Then, the fatal error is shown in the container log.
Solution
One solution is deep copying the
schemasandschemafirst to avoid the concurrent writesHey, we’ve ran into this issue recently after upgrading longhorn to
v1.2.4, happening on multiple clusters at the same time specifically during the backup operations. I’ve also sent the bundle just in case you need it.Validate on master-head 20220426 Result Pass
Using below test steps can reproduce
fatal error: concurrent map writeson v1.2.4, but not observed on master-head :kubectl logs <longhorn manager pod name> -n longhorn-system -fResult : On master-head, no
fatal error: concurrent map writeshappened on longhorn-manager, no restart happened on longhorn-manager@joshimoo https://gist.github.com/tyzbit/deedc6c8ea667418330692942ec841e6
@jenting Yes, this issue occurs without any intervention or action on my part.
I also checked syslog on one of the hosts around the time the issue happened and did not see any logs except the normal kubelet logs handling the container restarting.