go: runtime: incorrect frame information in traceback traversal may hang the process.
What version of Go are you using (go version
)?
go version go1.17.0 linux/arm64
Does this issue reproduce with the latest release?
It’s been also been reproduced with go version go1.17.6 linux/arm64
.
What operating system and processor architecture are you using (go env
)?
Linux ARM64 (it could affect other architectures where sys.MinFrameSize > 0
).
What did you do?
Run a go program with profiling enabled on Linux/ARM64. This may eventually (it may take from minutes to days) cause the process to hang. I have a binary and a core dump that showcases the problem (I can share those but they’re rather big -the core dump is 4.2GB-) but we have been unable to reproduce it since with this binary. We have also been able to reproduce in production with private code on linux/arm64 with version go1.17.6 (more on this later). We have tried hard and failed to generate a small reproducible program.
What did you expect to see?
The program should generate the stack traces and keep running normally.
What did you see instead?
The program hangs. This seems to be caused by a combination of an infinite loop in runtime.gentraceback
function while there’s a stopTheWorld
happening. As a result, the single running goroutine keeps running forever.
Here’s the backtrace when the problem occurs:
gef➤ bt
#0 0x000000000046c52c in runtime.pcvalue (f=..., off=0x20830, targetpc=0x48108d, cache=0x401880e6f8, strict=0x1, ~r5=<optimized out>, ~r6=<optimized out>) at runtime/symtab.go:815
#1 0x000000000046ca60 in runtime.funcspdelta (f=..., targetpc=0x48108d, cache=0x401880e6f8, ~r3=<optimized out>) at runtime/symtab.go:957
#2 0x0000000000473c5c in runtime.gentraceback (pc0=0x456078, sp0=0xffc41457b580, lr0=0x0, gp=0x4000602b60, skip=0x0, pcbuf=0x401880e958, max=0x40, callback={void (runtime.stkframe *, void *, bool *)} 0x401880e930, v=0x0, flags=0x6, ~r10=<optimized out>) at runtime/traceback.go:207
#3 0x000000000045941c in runtime.sigprof (pc=0xffffab3f84b0, sp=0xffc41457b560, lr=0x482f7c, gp=0x4000602b60, mp=0x4000580c00) at runtime/proc.go:4755
#4 0x00000000004648a0 in runtime.sighandler (sig=0x1b, info=0x401880eda0, ctxt=0x401880ee20, gp=0x4000602b60) at runtime/signal_unix.go:542
#5 0x0000000000464058 in runtime.sigtrampgo (sig=0x1b, info=0x401880eda0, ctx=0x401880ee20) at runtime/signal_unix.go:460
#6 0x0000000000483114 in runtime.sigtramp () at runtime/sys_linux_arm64.s:458
#7 <signal handler called>
#8 0x0000ffffab3f84b0 in __kernel_clock_gettime ()
#9 0x0000000000482f7c in runtime.nanotime1 () at runtime/sys_linux_arm64.s:350
#10 0x000000000048108d in runtime.call2097152 () at runtime/asm_arm64.s:431
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Inside sigprof
gentraceback
is called twice, the first time with the PC and SP arguments in sigprof
which fails, the second time with the vSDO PC and SP.
This seems to be the problem. If we take a look into the stack at the time:
gef➤ x/40gx 0xffc41457b560-0x8
0xffc41457b558: 0x0000ffc41457b5a8 0x000000000048108d
0xffc41457b568: 0x000000003178c8b9 0x000000000045a8d8
0xffc41457b578: 0x0000ffc41457b5a8 0x0000000000456078
0xffc41457b588: 0x0000000000000000 0x0000000000000000
0xffc41457b598: 0xffffffffffffffff 0x0000000000000000
0xffc41457b5a8: 0x0000ffc41457b5f8 0x0000000000454178
0xffc41457b5b8: 0x0000ffc41457b5f8 0x000000000045423c
0xffc41457b5c8: 0x0000ffc41457b628 0x000000000041f424
0xffc41457b5d8: 0x0010c763826714a8 0x000000000045425c
0xffc41457b5e8: 0x0010c763825a0ab9 0x0000004000001040
0xffc41457b5f8: 0x0000ffc41457b6e8 0x0000000000455dec
0xffc41457b608: 0x0000004000044000 0x0000000000000000
0xffc41457b618: 0x0000004000001040 0x000000000048108d
0xffc41457b628: 0x000000003178c90c 0x0000000000453570
0xffc41457b638: 0x0000ffc41457b668 0x00000000004509d8
0xffc41457b648: 0x0000000000000000 0x000000000048108d
0xffc41457b658: 0x000000003178cb6b 0x0010c763826714a8
0xffc41457b668: 0x0000ffc41457b698 0x0000000000456078
0xffc41457b678: 0x0000000000000000 0x0000000000000000
0xffc41457b688: 0x0000000000000001 0x0000004000580c00
We can see how the FP in both SP (0xffc41457b560) and vDSO SP (0xffc41457b580) point to the same next frame (0x0000ffc41457b5a8). Following the FP would result in the correct stack trace. From SP: PC=0x48108d (runtime.call2097152), SP=0xffc41457b560 PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0 PC=0x455dec (runtime.schedule), SP=0xffc41457b600 PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0 PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750 PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790
From vDSO SP: PC=0x456078 (runtime.checkTimers), SP=0xffc41457b580 PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0 PC=0x455dec (runtime.schedule), SP=0xffc41457b600 PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0 PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750 PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790
Instead, funcspdelta
is used which relays on pcsp
to retrieve the stack frame size.
For PC=0x456078 this delta is 0x50, which is used to jump to PC=0x41f424 (runtime.notewakeup
). I believe this is incorrect and the root cause of the bug since this results in an incorrect stack frame.
Looking at gentraceback
’s pcbuf
contents we can see the following information:
gef➤ x/8gx 0x401880e958
0x401880e958: 0x0000000000456079 0x0000000000456071
0x401880e968: 0x0000000000456078 0x0000000000456071
0x401880e978: 0x000000000041f424 0x0000000000000000
0x401880e988: 0x0000000000000000 0x0000000000000000
I’m unsure why runtime.checkTimers
appears 4 times before runtime.notewakeup
, but it seems to confirm the stack trace corruption.
After runtime.notewakeup
the next frame has PC=0x48108d (runtime.call2097152
).
At this point, funcspdelta
returns 0, so in the traceback traversal SP = FP from this point forward (in x86, AMD64 and WASM this FP is increased just afterward, and thus the FP = SP invariant doesn’t hold there and there’s no infinite loop in these architectures).
Additionally, when the frame being traversed is the frame of a wrapper function (funcID == funcID_wrapper
), as it’s in the case we are analyzing, the loop index n
doesn’t get incremented. Combining these two (SP = FP, n not incremented) results in an infinite loop.
This bug has also happened in a production instance with go version 1.17.6 (also on linux/ARM64). We modified the runtime to:
- print some debugging information when we enter the infinite loop.
- break the loop to prevent the process from hanging.
With these changes we have been able to confirm the pattern described above (but failed to create a small reproducible case):
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228e00, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291c5, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228eea, name=crypto/x509.(*PublicKeyAlgorithm).String, entry=0x228e70, nameoff=0x28e91, args=24, deferredreturn=0x0, pcsp=0x85325, pcfile=0x85332, pcln=0x8533f, npcdata=0x3, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x22957f, name=crypto/x509.(*SystemRootsError).Unwrap, entry=0x229510, nameoff=0x28fda, args=24, deferredreturn=0x0, pcsp=0x21906, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2292d9, name=crypto/x509.(*HostnameError).Error, entry=0x2292c0, nameoff=0x28f65, args=24, deferredreturn=0x0, pcsp=0x2bcff, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291bc, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2296e7, name=crypto/x509.(*UnknownAuthorityError).Error, entry=0x2296b0, nameoff=0x2905c, args=24, deferredreturn=0x0, pcsp=0x217d9, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228df9, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 17 (9 by maintainers)
Commits related to this issue
- [release-branch.go1.16] runtime: set vdsoSP to caller's SP consistently m.vdsoSP should be set to the SP of the caller of nanotime1, instead of the SP of nanotime1 itself, which matches m.vdsoPC. Oth... — committed to golang/go by cherrymui 3 years ago
- [release-branch.go1.17] runtime: set vdsoSP to caller's SP consistently m.vdsoSP should be set to the SP of the caller of nanotime1, instead of the SP of nanotime1 itself, which matches m.vdsoPC. Oth... — committed to golang/go by cherrymui 3 years ago
I think CL https://go-review.googlesource.com/c/go/+/337590 should fix this. As this is fixed at tip, I think we can close this and open backport issues for previous releases.
In the meantime, you can try Go 1.18beta1 ( https://go.dev/dl/#go1.18beta1 ) or manually apply that patch to your toolchain, and let me know if this it still fails. Thanks.
Confirming that backporting this individual change into our go1.17.6 runtime fixes the process hangs we were observing.
Change https://golang.org/cl/380715 mentions this issue:
[release-branch.go1.17] runtime: set vdsoSP to caller's SP consistently
That is because VDSO SP is wrong. See my comment at https://github.com/golang/go/issues/50772#issuecomment-1020332457