go: runtime/debug: add SetCrashOutput(file *os.File)
Currently there are difficulties with nicely capturing crash output from Go programs. Crashes print to stderr, however, usually both stdout and stderr are used by programs for other purposes. While it’s not wrong to output it to stderr, it mixes two outputs together making later separation more difficult.
Capturing crashes is useful for post-mortem debugging and sending reports. Of course, due to unknown crashed program state using a custom callback is understandably dangerous.
A simpler alternative would be to allow changing file descriptor where to print crashes. In the simplest form it could look like:
package main
import "runtime/debug"
func init(){
debug.SetCrashOutputFD(3)
}
This would allow passing in a fd
for separate file, pipe to another “crash monitor” process or connect to an external server. Of course there’s a slight possibility of write
taking a long time, when passing in a pipe/connection slowing down the crashing of the program.
With regards to naming it could also be SetCrashFD
, SetTracebackFD
or SetTracebackOutputFD
.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 25
- Comments: 124 (117 by maintainers)
Commits related to this issue
- ringlogger: hook into global panic writer This is a grotesque hack, and hopefully upstream Go will provide a nicer way of doing this, but already it seems quite adept at catching panics. See https://... — committed to WireGuard/wireguard-windows by zx2c4 4 years ago
- ringlogger: hook into global panic writer This is a grotesque hack, and hopefully upstream Go will provide a nicer way of doing this, but already it seems quite adept at catching panics. See https://... — committed to WireGuard/wireguard-windows by zx2c4 4 years ago
- ringlogger: hook into global panic writer This is a grotesque hack, and hopefully upstream Go will provide a nicer way of doing this, but already it seems quite adept at catching panics. See https://... — committed to WireGuard/wireguard-windows by zx2c4 4 years ago
- ringlogger: hook into global panic writer This is a grotesque hack, and hopefully upstream Go will provide a nicer way of doing this, but already it seems quite adept at catching panics. See https://... — committed to WireGuard/wireguard-windows by zx2c4 4 years ago
- ringlogger: hook into global panic writer This is a grotesque hack, and hopefully upstream Go will provide a nicer way of doing this, but already it seems quite adept at catching panics. See https://... — committed to WireGuard/wireguard-windows by zx2c4 4 years ago
- logpolicy: only log panics when running under systemd Given that https://github.com/golang/go/issues/42888 is coming, this catches most practical panics without interfering in our development environ... — committed to tailscale/tailscale by crawshaw 3 years ago
- logpolicy: only log panics when running under systemd Given that https://github.com/golang/go/issues/42888 is coming, this catches most practical panics without interfering in our development environ... — committed to tailscale/tailscale by crawshaw 3 years ago
- logpolicy: only log panics when running under systemd Given that https://github.com/golang/go/issues/42888 is coming, this catches most practical panics without interfering in our development environ... — committed to tailscale/tailscale by crawshaw 3 years ago
- runtime: allow builtin write function to be redirected with function pointer The x/sys/windows package currently uses go:linkname for other facilities inside of runtime that are not suitable to be ex... — committed to golang/go by zx2c4 4 years ago
- gopls/internal/lsp/protocol: report panics in RPC goroutines This change adds recover() + bug.Reportf to the generated dispatch routines for the server and client, so that we learn through telemetry ... — committed to golang/tools by adonovan 7 months ago
On further reflection, one important benefit of
*os.File
would be that it would let the runtime/debug package hold on to the underlying file object and keep it from being garbage collected, which in turn would keep the fd from being closed. Changing to *os.File.OK, so it sounds like this is where we are.
We separate out crash output from user-generated and other runtime tracing prints (GODEBUG=gctrace=1). Crash output is eligible to be copied into a separate system log, but it continues to be printed to standard error as well (otherwise a crashing program would exit silently).
On all systems, debug.SetCrashFD(fd uintptr) sets the current redirection file descriptor / handle.
Additionally, x/sys/windows has a linkname backdoor to implement
We discussed having GODEBUG=crashfd=3, but on reflection that seems like a dangerous thing to have child processes inherit (especially if the parent doesn’t know to preserve fd 3 when starting the child, fd 3 will be some unrelated file at crash time). So let’s not do that part.
Any objections to this plan?
Unexported hooks are really problematic from an API standpoint, especially if widely used programs adopt them. That can’t possibly be the right approach here.
And hooks generally have the problem that the runtime may not be in a good enough state to run the code in the hook. It is basically impossible to do anything interesting in a signal handler these days on Unix for the same reasons. Lots of signal handlers seem to work but actually wedge the program in rare cases. I don’t want to repeat that mistake here.
It seems OK for x/sys/windows to do setup, but we really need the critical “can’t do much safely” implementation to be in package runtime, where it can be adjusted and updated as the runtime changes. What is the API that x/sys/windows would need for that?
The issue I see here is just how much we let Go programs attempt to do during a crash, in which case the runtime may be in a terrible state and really not capable of much. In particular it may not be capable of running arbitrary Go code.
Writing to a different fd is easy. Calling a reporting DLL call on Windows with a user-supplied dwEventId also seems easy. In both cases it’s just one system call. I’m not sure we need to (or can reliably) support arbitrary reporting callbacks such as code to write to an mmap’ed ring-buffer file. (Especially if you can already write to a non-ring-buffer file with the fd redirect.)
@alexbrainman says that it’s complicated to explain what the dwEventId is, but we don’t need to do that in the runtime. We just need to accept a dwEventId and pass it along to the Windows API call.
@alexbrainman, so roughly what I’m thinking is the following. However, I might also forget some internal details that could make the following difficult with Windows.
The most basic thing is to send the output to a separate file:
The next step would be to create a watchdog sub-process:
You could also create governors using ExtraFiles:
For windows, I think (named) pipes could also be used:
Like @zx2c4, I’d also be interested in getting the the runtime’s crash
println
statement, as they often provide critical context about a throw, which can’t do formatting.In my view, there are three different types of
println
calls:print
s immediately prior to a throw. In my view, these are effectively part of the throw, providing context about the crash, which may be critical to understanding the crash. To be honest, I’m not sure why one wouldn’t want this, beyond technical complexity.print
s unrelated to crashing (e.g.,gctrace
,schedtrace
). I personally don’t have a preference either way here.print
s. These probably shouldn’t be included, but I don’t think I’ve ever seen one in practice, so if they are sufficiently rare, I don’t think it would be awful to include them.Adding that it’s possible to write programs that work both as a service and as a regular command line tool.
I don’t think it’s a good idea to have the runtime call ReportEvent on its own. Most apps do not use the eventlog. Some services use the eventlog, but not all services use it. The Go api for the eventlog is not in the Go runtime; it’s in a helper auxilary module of x/sys/windows. Finally, my own services do not use the eventlog; in my case, I have a file-backed mmap’d ring buffer for this stuff where I simply copy the bytes.
Hence, I’m learning toward CL278792 as the best general solution, combined with https://github.com/golang/go/issues/42888#issuecomment-763638252 .
Change https://go.dev/cl/548121 mentions this issue:
runtime/debug: add Example use of SetCrashOutput for a watchdog
Change https://go.dev/cl/547978 mentions this issue:
runtime/debug: SetCrashOutput sets the FD for fatal panics
What’s the status on this? Proposal has been accepted but no patch yet, correct? Is anybody working on that? We have a Windows service executable which once in a while crashes, we suspect concurrent map writes, but without the traces we are pretty much lost how and where this happens. So a way to redirect that to a file would be greatly appreciated!
No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group
Instead of
debug.SetCrashFD(fd uintptr)
it would bedebug.SetCrashFile(f *os.File)
, because of the concerns raised by @alexbrainman in https://github.com/golang/go/issues/42888#issuecomment-749352457.Unless there’s a reason we don’t want to use
*os.File
there.Exactly. Indeed, it is limited, but to us there was a whole class of crashes we could now properly inspect.
Although I do think a better way to have a generic way to go about it would be to leverage a native crash reported, such as breakpad or crashpad, but there are some blocking issues for that on ARM64.
@steeve @zx2c4 I believe that https://github.com/znly/go/commit/6af49debb0b604c1bfe209d5801f985afb327c72 should work for all panics, but only panics. It is effectively a global defer, so I can’t think of any panics it wouldn’t work on.
However it would certainly not work for throws, which this issue would like to capture as well. If the runtime is in a corrupt state, I’m not sure it would even be able to do the interface conversion to pass an error to the crash function. The contents of
onPanic
allocating a buffer and calling complex functions likeruntime.Stacks
andfmt.Println
are also examples of the kinds of things that the runtime may not be able to support while throwing.@zx2c4
SetPanicNotifier
is an example of a callback that, as @rsc says, has the problem that the runtime may not be in a good enough state to run the code in the hook.For what is worth, we have implemented it at https://github.com/znly/go/commit/6af49debb0b604c1bfe209d5801f985afb327c72
It works well for us and it’s pretty simple:
@egonelbre I prefer your option 3. It makes new API usage obvious. And I could use approach similar to https://github.com/golang/go/issues/42888#issuecomment-737616766 to write Windows Service crash into a file.
Thank you.
Alex
Sticking that into the Go runtime itself is not a good idea. EventLog is complicated – more so than the Android logger functions – and as such will require configuration. This is the kind of thing best suited to go in x/sys/windows/svc or similar. The x/sys/windows package already uses //go:linkname for necessary faculties exposed by the runtime. Adding one for setting the logger callback function would be very sensible. I’ll send a patch for that.
I wonder why not. StartProcess can pass Files just fine. Do you know what’s different about ExtraFiles? Seems like that’s a bug we should fix.
Oh, you know if all you’re trying to do is
dup2
over stderr on Windows, you can already hack the panic and println output fd like this:That works reasonably well, even inside of services:
Changing the fd isn’t particularly difficult, as you can see there, which is why I keep poking at, “what about running custom logger functions?”, which is the much more interesting goal (to me, at least).
Yes, it could work. The main drawback is that windows services need to respond to control messages. https://pkg.go.dev/golang.org/x/sys/windows/svc#Handler
This would mean that the governor process would need to delegate all those messages to the subprocess, which would introduce an additional complication. Although, there might be a way to make subprocess handle the service messages directly, I haven’t seen it.
Yea, documenting it is probably the wrong thing to do. But having something in there akin to nanotime would be “very nice”…and would allow me to implement something for Go’s Windows service package that uses it to ship things to eventlog automatically.
Anyway, I realize pushing this too hard is an loosing battle, for good reason. But if somehow the implementation happened to be structured in a way that made it coincidentally possible to go:linkname it from elsewhere, I would be very happy.
@zx2c4 I’m not fond of documenting a callback to use with
go:linkname
either. I tend to think that people who operate at that level can customize their Go distribution.Should we just have GODEBUG=debugfd=3?
/cc @aclements
@egonelbre how do you propose to use your API on Windows?
I agree. I wanted this feature to capture crash dumps of a Windows service.
/cc @zx2c4
Alex