go: runtime: gentraceback() dead loop on arm64 casued the process hang
What version of Go are you using (go version
)?
$ go version go version go1.18 linux/arm64
Does this issue reproduce with the latest release?
Yes, it reproduce in the latest Go1.18
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env [root@tidb-1-0 ~]# go env GO111MODULE="" GOARCH="arm64" GOBIN="" GOCACHE="/root/.cache/go-build" GOENV="/root/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="arm64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/root/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/root/go" GOPRIVATE="" GOPROXY="[https://proxy.golang.org,direct](https://proxy.golang.org%2Cdirect/)" GOROOT="/usr/local/go" GOSUMDB="[sum.golang.org](http://sum.golang.org/)" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64" GOVCS="" GOVERSION="go1.18" GCCGO="gccgo" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/dev/null" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1414115226=/tmp/go-build -gno-record-gcc-switches"
What did you do?
Run a benchmark workload in our database https://github.com/pingcap/tidb/issues/31477 After we recently upgrade the toolchain to Go1.18, it reproduce stably. Maybe it takes 6~8h or at most 24h, the tidb-server hang. When that happen, the tidb-server CPU is 100% (one thread fully occupied) and it stop serving any request.
I use gdb
and dlv
to debug it, and find that one thread is dead loop in the gentraceback()
function (which seems also holding a lock), and all the other threads are in futex()
function (blocked by that lock)
I don’t know how it can be reproduced in a minimal code snippet. Maybe I can provide the code dump of the process? but that would be too large.
Some more information I can provide:
- This bug only reproduce in arm64, in x86 we never encounter it.
- It’s said to be first observed in TiDB 5.4, which use Go1.16.3 or Go1.16.4 (I’m not quite remember), also on arm64.
- Go1.18 seems to be easier to trigger the bug, we run TiDB v5.1.4 which is built using Go1.16.4, it didn’t reproduce after more than 36h
The phenomenon looks like another issue https://github.com/golang/go/issues/50772, but I’ve checked that fix is included in Go1.18, so it might be another new case.
More details: The code doesn’t enter this block https://github.com/golang/go/blob/4aa1efed4853ea067d665a952eee77c52faac774/src/runtime/traceback.go#L357-L377
Then the code run to this branch https://github.com/golang/go/blob/4aa1efed4853ea067d665a952eee77c52faac774/src/runtime/traceback.go#L379-L380
funcID
is funcID_wrapper
Note, after n--
and n++
, the value of n
is never changed, so for n < max
can’t break the loop.
This code block will set frame to its upper frame, but frame.fn
and flr
are the same one!
Thus it result in a dead loop.
I can workaround this bug by this patch https://github.com/tiancaiamao/go/commit/5d1aea43625ff3575da04de6b9644bb84eb45517 But I still can’t figure out the root cause of the bug.
Print the stack when debugging in dlv:
(dlv) thread 1945
Switched from 1925 to 1945
(dlv) bt
0 0x00000000013211d8 in runtime.funcdata
at /usr/local/go1.18/src/runtime/symtab.go:1097
1 0x00000000013211d8 in runtime.gentraceback
at /usr/local/go1.18/src/runtime/traceback.go:357
2 0x00000000012caa8c in runtime.makechan
at :0
3 0x000000000135a19c in context.(*cancelCtx).Done
at /usr/local/go1.18/src/context/context.go:367
4 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
5 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
6 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
7 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
8 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
9 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
10 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
11 0x000000000135bcd4 in context.(*valueCtx).Done
at <autogenerated>:1
12 0x0000000001359878 in context.propagateCancel
at /usr/local/go1.18/src/context/context.go:251
13 0x0000000001359744 in context.WithCancel
at /usr/local/go1.18/src/context/context.go:237
14 0x0000000001bdb318 in github.com/tikv/client-go/v2/internal/retry.(*Backoffer).Fork
at /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/internal/retry/backoff.go:259
15 0x0000000001cbc258 in github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1
at /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1992
16 0x000000000132eaa4 in runtime.goexit
at /usr/local/go1.18/src/runtime/asm_arm64.s:1259
(dlv) gr
Thread 1945 at /usr/local/go1.18/src/runtime/symtab.go:845
Goroutine 1703734534:
Runtime: /usr/local/go1.18/src/runtime/symtab.go:845 runtime.findfunc (0x1319340)
User: /usr/local/go1.18/src/context/context.go:367 context.(*cancelCtx).Done (0x135a19c)
Go: /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1978 github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker (0x1cbbf80)
Start: /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1978 github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1 (0x1cbc1c0)
Labels: "plan_digest":"�a-g+\x00c��+B��Xl�\"]��\x1e3ߩ�\x11i.]���", "sql_digest":"1�:,U[�pb\vҔ�\x1a�\x0f��\x1a\x01DJ-�b���_XM�"
In gdb:
(gdb) bt
#0 0x0000000001321c08 in runtime.gentraceback (pc0=20059304, sp0=281472080470624, lr0=<optimized out>, gp=0x40009081a0, skip=0, pcbuf=0x4000042968, max=64, callback={void (runtime.stkframe *, void *, bool *)} 0x4000042940, v=0x0, flags=6, ~r0=<optimized out>)
at /usr/local/go1.18/src/runtime/traceback.go:379
#1 0x00000000013072d0 in runtime.sigprof (pc=6, sp=<optimized out>, lr=<optimized out>, gp=0x15, mp=0x4000600c00) at /usr/local/go1.18/src/runtime/proc.go:4507
#2 0x0000000001311e64 in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0x15) at /usr/local/go1.18/src/runtime/signal_unix.go:613
#3 0x0000000001311614 in runtime.sigtrampgo (sig=27, info=0x4000042da0, ctx=0x4000042e20) at /usr/local/go1.18/src/runtime/signal_unix.go:477
#4 0x00000000013302ac in runtime.sigtrampgo (sig=27, info=0x4000042da0, ctx=0x4000042e20) at <autogenerated>:1
#5 0x000000000132fbc4 in runtime.sigtramp () at /usr/local/go1.18/src/runtime/sys_linux_arm64.s:489
#6 <signal handler called>
#7 0x00000000013214a8 in runtime.gentraceback (pc0=19704460, sp0=276041063296, lr0=<optimized out>, gp=0x40562c0ea0, skip=0, pcbuf=0x0, max=2147483647, callback={void (runtime.stkframe *, void *, bool *)} 0xffff535ee5c0, v=0xffff535ee618, flags=0, ~r0=<optimized out>)
at /usr/local/go1.18/src/runtime/traceback.go:331
#8 0x0000000001315550 in runtime.copystack (gp=0x40562c0ea0, newsize=4096) at /usr/local/go1.18/src/runtime/stack.go:930
#9 0x00000000013159ac in runtime.newstack () at /usr/local/go1.18/src/runtime/stack.go:1110
#10 0x000000000132c4a0 in runtime.morestack () at /usr/local/go1.18/src/runtime/asm_arm64.s:310
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
The stack in dlv
and gdb
doesn’t looks exactly same, although the last frame are both in gentraceback
(or its children function).
I try to get the stack information from the pcbuf
to reason about the real stack, the pc address and symbol relation is just my guess.
(gdb) p pcbuf[0]
$2 = 20059305
13214a9 = runtime.gentraceback?
(gdb) p pcbuf[1]
$3 = 20010320
1315550 = runtime.copystack?
(gdb) p pcbuf[2]
$4 = 20011436
13159ac = runtime.newstack?
(gdb) p pcbuf[3]
$5 = 19704460
12caa8c = runtime.makechan?
(gdb) p pcbuf[4]
$6 = 19912820
12fd874
(gdb) p pcbuf[5]
$7 = 19912865
12fd8a1
(gdb) p pcbuf[6]
$8 = 0
Anything else I can provide for you to debug it?
What did you expect to see?
No deadloop in gentraceback()
caused server hang.
What did you see instead?
gentraceback()
dead loop on arm64 casued the process hang
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (6 by maintainers)
Commits related to this issue
- runtime: use saved LR when unwinding through morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F ... — committed to tiancaiamao/go by tiancaiamao 2 years ago
- runtime: use saved LR when unwinding through morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F... — committed to DataDog/go by cherrymui 2 years ago
- runtime: use saved LR when unwinding through morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F... — committed to DataDog/go by cherrymui 2 years ago
- [release-branch.go1.17] runtime: use saved LR when unwinding through morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... s... — committed to golang/go by cherrymui 2 years ago
- [release-branch.go1.18] runtime: use saved LR when unwinding through morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... s... — committed to golang/go by cherrymui 2 years ago
We’re also experiencing this issue and would appreciate a backport of this fix for Go 1.18.
Will this fix be cherry-picked to 1.18? And when will we get the first version with this issue fixed? @cherrymui