longhorn: [BUG] fatal error: concurrent map writes

Describe the bug During normal execution, a longhorn-manager instance throws a go panic and exits with the following message and a huge stack trace: fatal error: concurrent map writes.

This has occurred 6 times in the last 24 hours on my cluster, on random nodes.

To Reproduce Steps to reproduce the behavior:

  1. Create a 1.2.0 Longhorn cluster with volumes
  2. Watch the logs of longhorn-manager containers Wait 6 hours
  3. Observe the error occurs on a random node

Expected behavior Longhorn-manager should not throw go panics and the container should not restart unexpectedly.

Log

There are a large number of goroutines listed in the panic, I included only the first.

fatal error: concurrent map writes 
fatal error: concurrent map writes 

goroutine 1215482 [running]: 
runtime.throw(0x1b8963e, 0x15)
	/usr/local/go/src/runtime/panic.go:1114 +0x72 fp=0xc001623760 sp=0xc001623730 pc=0x4348e2
runtime.mapassign_faststr(0x19023e0, 0xc0030a99b0, 0x1b79e12, 0xa, 0x24)
	/usr/local/go/src/runtime/map_faststr.go:291 +0x3de fp=0xc0016237c8 sp=0xc001623760 pc=0x41359e
github.com/rancher/go-rancher/api.populateSchema(0xc002e80440, 0xc0030c7a20)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:101 +0xdf fp=0xc001623818 sp=0xc0016237c8 pc=0x138a79f
github.com/rancher/go-rancher/api.SchemasHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00)
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:93 +0xf6 fp=0xc001623878 sp=0xc001623818 pc=0x138ec06
net/http.HandlerFunc.ServeHTTP(0xc00127dd90, 0x1e920a0, 0xc000859880, 0xc00356cd00)
	/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc0016238a0 sp=0xc001623878 pc=0x7e7284
github.com/rancher/go-rancher/api.ApiHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00) 
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/rancher/go-rancher/api/handler.go:26 +0x13b fp=0xc0016238f8 sp=0xc0016238a0 pc=0x138e27b
net/http.HandlerFunc.ServeHTTP(0xc0030add80, 0x1e920a0, 0xc000859880, 0xc00356cd00) 
	/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623920 sp=0xc0016238f8 pc=0x7e7284
github.com/gorilla/context.ClearHandler.func1(0x1e920a0, 0xc000859880, 0xc00356cd00) 
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/context/context.go:141 +0x74 fp=0xc001623968 sp=0xc001623920 pc=0x1388b74
net/http.HandlerFunc.ServeHTTP(0xc0030adda0, 0x1e920a0, 0xc000859880, 0xc00356cd00) 
	/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623990 sp=0xc001623968 pc=0x7e7284
github.com/gorilla/mux.(*Router).ServeHTTP(0xc00030d380, 0x1e920a0, 0xc000859880, 0xc00356cb00) 
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/mux/mux.go:212 +0xe2 fp=0xc001623ac0 sp=0xc001623990 pc=0x137d9c2
github.com/longhorn/longhorn-manager/util.filteredLoggingHandler.ServeHTTP(0xc0033a5a40, 0x1e4ca60, 0xc00030d380, 0x1e4f040, 0xc0033a5a70, 0x1e920a0, 0xc000859880, 0xc00356cb00)
	/go/src/github.com/longhorn/longhorn-manager/util/util.go:550 +0xe4 fp=0xc001623b00 sp=0xc001623ac0 pc=0x12c39b4
github.com/longhorn/longhorn-manager/util.(*filteredLoggingHandler).ServeHTTP(0xc0033a5aa0, 0x1e920a0, 0xc000859880, 0xc00356cb00)
	<autogenerated>:1 +0x9b fp=0xc001623b78 sp=0xc001623b00 pc=0x12c824b
github.com/gorilla/handlers.ProxyHeaders.func1(0x1e920a0, 0xc000859880, 0xc00356cb00) 
	/go/src/github.com/longhorn/longhorn-manager/vendor/github.com/gorilla/handlers/proxy_headers.go:59 +0x10c fp=0xc001623bc0 sp=0xc001623b78 pc=0x12b39bc
net/http.HandlerFunc.ServeHTTP(0xc0030acac0, 0x1e920a0, 0xc000859880, 0xc00356cb00)
	/usr/local/go/src/net/http/server.go:2012 +0x44 fp=0xc001623be8 sp=0xc001623bc0 pc=0x7e7284
net/http.serverHandler.ServeHTTP(0xc0002840e0, 0x1e920a0, 0xc000859880, 0xc00356cb00) 
	/usr/local/go/src/net/http/server.go:2807 +0xa3 fp=0xc001623c18 sp=0xc001623be8 pc=0x7ea613 
net/http.(*conn).serve(0xc001fd6460, 0x1e96c60, 0xc002e80300) 
	 /usr/local/go/src/net/http/server.go:1895 +0x86c fp=0xc001623fc8 sp=0xc001623c18 pc=0x7e608c 
runtime.goexit() 
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc001623fd0 sp=0xc001623fc8 pc=0x466dc1
created by net/http.(*Server).Serve 
	/usr/local/go/src/net/http/server.go:2933 +0x35c 

Support Bundle

longhorn-support-bundle_5be709ef-a54d-46ea-971f-a9f7c4fb4b26_2021-09-28T04-03-38Z.zip

Environment:

Longhorn version: 1.2.0
Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Helm
Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: kubeadm
    Kubernetes version: 1.20.10
    Number of management node in the cluster: 3
    Number of worker node in the cluster: 6
Node config
    OS type and version: Ubuntu 20.04.1 LTS
    CPU per node: 8
    Memory per node: 16GB (plus 1x32GB)
    Disk type(e.g. SSD/NVMe): NVMe
    Network bandwidth between the nodes: 2.5Gbit
Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): Baremetal
Number of Longhorn volumes in the cluster: 22

Additional context N/A

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 15 (11 by maintainers)

Most upvoted comments

Root cause

Concurrent map writes happened in

  • SchemasHandler->populateSchema.
  • SchemaHandler->populateSchema.

https://github.com/rancher/go-rancher/blob/master/api/handler.go#L82

Reproduce steps

  1. Create one node LH cluster
  2. Trace the longhorn manager container log by crictl logs --follow <longhorn-manger-container-name>
  3. Run the script twice by bash run.sh &
#!/bin/sh

for i in {0..1000}
do
        curl -Ls http://{LH-IP}/v1/schemas > /dev/null &
done

or

#!/bin/sh

for i in {0..1000}
do
        curl -Ls http://{LH-IP}/v1/schemas/backupTarget > /dev/null &
done

Then, the fatal error is shown in the container log.

Solution

One solution is deep copying the schemas and schema first to avoid the concurrent writes

Hey, we’ve ran into this issue recently after upgrading longhorn to v1.2.4, happening on multiple clusters at the same time specifically during the backup operations. I’ve also sent the bundle just in case you need it.

Validate on master-head 20220426 Result Pass

Using below test steps can reproduce fatal error: concurrent map writes on v1.2.4, but not observed on master-head :

  1. Create one node LH cluster
  2. Trace the longhorn manager pod log by kubectl logs <longhorn manager pod name> -n longhorn-system -f
  3. run shell script mentioned in comment step 3

Result : On master-head, no fatal error: concurrent map writes happened on longhorn-manager, no restart happened on longhorn-manager

@joshimoo https://gist.github.com/tyzbit/deedc6c8ea667418330692942ec841e6

@jenting Yes, this issue occurs without any intervention or action on my part.

I also checked syslog on one of the hosts around the time the issue happened and did not see any logs except the normal kubelet logs handling the container restarting.