go: runtime/debug: add SetCrashOutput(file *os.File)

Currently there are difficulties with nicely capturing crash output from Go programs. Crashes print to stderr, however, usually both stdout and stderr are used by programs for other purposes. While it’s not wrong to output it to stderr, it mixes two outputs together making later separation more difficult.

Capturing crashes is useful for post-mortem debugging and sending reports. Of course, due to unknown crashed program state using a custom callback is understandably dangerous.

A simpler alternative would be to allow changing file descriptor where to print crashes. In the simplest form it could look like:

package main
import "runtime/debug"

func init(){
	debug.SetCrashOutputFD(3)
}

This would allow passing in a fd for separate file, pipe to another “crash monitor” process or connect to an external server. Of course there’s a slight possibility of write taking a long time, when passing in a pipe/connection slowing down the crashing of the program.

With regards to naming it could also be SetCrashFD, SetTracebackFD or SetTracebackOutputFD.

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 25
  • Comments: 124 (117 by maintainers)

Commits related to this issue

Most upvoted comments

On further reflection, one important benefit of *os.File would be that it would let the runtime/debug package hold on to the underlying file object and keep it from being garbage collected, which in turn would keep the fd from being closed. Changing to *os.File.

OK, so it sounds like this is where we are.

We separate out crash output from user-generated and other runtime tracing prints (GODEBUG=gctrace=1). Crash output is eligible to be copied into a separate system log, but it continues to be printed to standard error as well (otherwise a crashing program would exit silently).

On all systems, debug.SetCrashFD(fd uintptr) sets the current redirection file descriptor / handle.

Additionally, x/sys/windows has a linkname backdoor to implement

// SetCrashEvent makes Go runtime output during crashes and panics
// to be delivered to the event log l using the specified event id eid.
func (l *Log) SetCrashEvent(eid uint32) {

We discussed having GODEBUG=crashfd=3, but on reflection that seems like a dangerous thing to have child processes inherit (especially if the parent doesn’t know to preserve fd 3 when starting the child, fd 3 will be some unrelated file at crash time). So let’s not do that part.

Any objections to this plan?

Unexported hooks are really problematic from an API standpoint, especially if widely used programs adopt them. That can’t possibly be the right approach here.

And hooks generally have the problem that the runtime may not be in a good enough state to run the code in the hook. It is basically impossible to do anything interesting in a signal handler these days on Unix for the same reasons. Lots of signal handlers seem to work but actually wedge the program in rare cases. I don’t want to repeat that mistake here.

It seems OK for x/sys/windows to do setup, but we really need the critical “can’t do much safely” implementation to be in package runtime, where it can be adjusted and updated as the runtime changes. What is the API that x/sys/windows would need for that?

The issue I see here is just how much we let Go programs attempt to do during a crash, in which case the runtime may be in a terrible state and really not capable of much. In particular it may not be capable of running arbitrary Go code.

Writing to a different fd is easy. Calling a reporting DLL call on Windows with a user-supplied dwEventId also seems easy. In both cases it’s just one system call. I’m not sure we need to (or can reliably) support arbitrary reporting callbacks such as code to write to an mmap’ed ring-buffer file. (Especially if you can already write to a non-ring-buffer file with the fd redirect.)

@alexbrainman says that it’s complicated to explain what the dwEventId is, but we don’t need to do that in the runtime. We just need to accept a dwEventId and pass it along to the Windows API call.

@alexbrainman, so roughly what I’m thinking is the following. However, I might also forget some internal details that could make the following difficult with Windows.

The most basic thing is to send the output to a separate file:

package main

import (
	"os"
	"runtime/debug"
)

func main() {
	f, err := os.Open("service_log.txt")
	if err != nil {
		os.Exit(1)
		return
	}

	debug.SetCrashFD(f.Fd())

	// rest of the service logic
}

The next step would be to create a watchdog sub-process:

package main

import (
	"os"
	"os/exec"
	"runtime/debug"
)

func main() {
	r, w, err := os.Pipe()
	if err != nil {
		os.Exit(1)
		return
	}

	// this could also self-start, but in a different mode
	// but we'll use a separate binary for now
	cmd := exec.Command("watchdog.exe") 
	cmd.Stdin = r
	// also set other flags here to ensure that the watchdog
	// doesn't immediately close together with the parent.
	cmd.Start()

	debug.SetCrashFD(w.Fd())

	// rest of the service logic
}

You could also create governors using ExtraFiles:

package main

import (
	"os"
	"os/exec"
	"runtime/debug"
)

func main() {
	r, w, err := os.Pipe()
	if err != nil {
		os.Exit(1)
		return
	}

	cmd := exec.Command("watchdog.exe") 
	cmd.ExtraFiles = []*os.File{w}
	cmd.Start()

	// monitor `r`
}

// The other process has `debug.SetCrashFD(3)` in init,
// with some check that `3` actually exists.

For windows, I think (named) pipes could also be used:

f, err := os.Open(`\\.\pipe\randomhexvalue`)
...
debug.SetCrashFD(f.Fd())

Like @zx2c4, I’d also be interested in getting the the runtime’s crash println statement, as they often provide critical context about a throw, which can’t do formatting.

In my view, there are three different types of println calls:

  • prints immediately prior to a throw. In my view, these are effectively part of the throw, providing context about the crash, which may be critical to understanding the crash. To be honest, I’m not sure why one wouldn’t want this, beyond technical complexity.
  • Runtime prints unrelated to crashing (e.g., gctrace, schedtrace). I personally don’t have a preference either way here.
  • User-generated prints. These probably shouldn’t be included, but I don’t think I’ve ever seen one in practice, so if they are sufficiently rare, I don’t think it would be awful to include them.

Adding that it’s possible to write programs that work both as a service and as a regular command line tool.

I don’t think it’s a good idea to have the runtime call ReportEvent on its own. Most apps do not use the eventlog. Some services use the eventlog, but not all services use it. The Go api for the eventlog is not in the Go runtime; it’s in a helper auxilary module of x/sys/windows. Finally, my own services do not use the eventlog; in my case, I have a file-backed mmap’d ring buffer for this stuff where I simply copy the bytes.

Hence, I’m learning toward CL278792 as the best general solution, combined with https://github.com/golang/go/issues/42888#issuecomment-763638252 .

Change https://go.dev/cl/548121 mentions this issue: runtime/debug: add Example use of SetCrashOutput for a watchdog

Change https://go.dev/cl/547978 mentions this issue: runtime/debug: SetCrashOutput sets the FD for fatal panics

What’s the status on this? Proposal has been accepted but no patch yet, correct? Is anybody working on that? We have a Windows service executable which once in a while crashes, we suspect concurrent map writes, but without the traces we are pretty much lost how and where this happens. So a way to redirect that to a file would be greatly appreciated!

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

Instead of debug.SetCrashFD(fd uintptr) it would be debug.SetCrashFile(f *os.File), because of the concerns raised by @alexbrainman in https://github.com/golang/go/issues/42888#issuecomment-749352457.

Unless there’s a reason we don’t want to use *os.File there.

It is effectively a global defer, so I can’t think of any panics it wouldn’t work on.

Exactly. Indeed, it is limited, but to us there was a whole class of crashes we could now properly inspect.

Although I do think a better way to have a generic way to go about it would be to leverage a native crash reported, such as breakpad or crashpad, but there are some blocking issues for that on ARM64.

@steeve @zx2c4 I believe that https://github.com/znly/go/commit/6af49debb0b604c1bfe209d5801f985afb327c72 should work for all panics, but only panics. It is effectively a global defer, so I can’t think of any panics it wouldn’t work on.

However it would certainly not work for throws, which this issue would like to capture as well. If the runtime is in a corrupt state, I’m not sure it would even be able to do the interface conversion to pass an error to the crash function. The contents of onPanic allocating a buffer and calling complex functions like runtime.Stacks and fmt.Println are also examples of the kinds of things that the runtime may not be able to support while throwing.

@zx2c4 SetPanicNotifier is an example of a callback that, as @rsc says, has the problem that the runtime may not be in a good enough state to run the code in the hook.

For what is worth, we have implemented it at https://github.com/znly/go/commit/6af49debb0b604c1bfe209d5801f985afb327c72

It works well for us and it’s pretty simple:

    import (
        "fmt",
        "runtime"
    )

    func onPanic(err interface{}) {
    	   fmt.Println("panic reason:", err)
    	   dump := make([]byte, 1*1024*1024) // 1 MB buffer
    	   runtime.Stack(dump, true)
    	   fmt.Println(string(dump))
    }

    func main() {
    	   runtime.SetGlobalPanicHandler(onPanic)

    	   // provoke a panic from a SIGNAL
    	   var i *int
    	   *i = 42

    	   println("program exit normally")
    }
  1. Change the API to accept *os.File instead of fd.

@egonelbre I prefer your option 3. It makes new API usage obvious. And I could use approach similar to https://github.com/golang/go/issues/42888#issuecomment-737616766 to write Windows Service crash into a file.

Thank you.

Alex

I hate how complex that logic is for such a fundamental operation, but maybe it would make sense to do something similar for Windows services and have their writes go straight to the event log? IMO, in both cases this should be the OS’s job, but it’s not doing it.

Sticking that into the Go runtime itself is not a good idea. EventLog is complicated – more so than the Android logger functions – and as such will require configuration. This is the kind of thing best suited to go in x/sys/windows/svc or similar. The x/sys/windows package already uses //go:linkname for necessary faculties exposed by the runtime. Adding one for setting the logger callback function would be very sensible. I’ll send a patch for that.

Does ExtraFiles work on Windows?

It does not.

I wonder why not. StartProcess can pass Files just fine. Do you know what’s different about ExtraFiles? Seems like that’s a bug we should fix.

Oh, you know if all you’re trying to do is dup2 over stderr on Windows, you can already hack the panic and println output fd like this:

	f, _ := os.Create(`C:\helloworld.txt`)
	windows.SetStdHandle(^uint32(11), windows.Handle(f.Fd()))
	panic("oh nose!")

That works reasonably well, even inside of services:

image


Changing the fd isn’t particularly difficult, as you can see there, which is why I keep poking at, “what about running custom logger functions?”, which is the much more interesting goal (to me, at least).

The first process is the monitor process and opens the FD, then starts the second process with the GODEBUG environment set. Does this not work for Windows services?

Yes, it could work. The main drawback is that windows services need to respond to control messages. https://pkg.go.dev/golang.org/x/sys/windows/svc#Handler

This would mean that the governor process would need to delegate all those messages to the subprocess, which would introduce an additional complication. Although, there might be a way to make subprocess handle the service messages directly, I haven’t seen it.

@zx2c4 I’m not fond of documenting a callback to use with go:linkname either. I tend to think that people who operate at that level can customize their Go distribution.

Yea, documenting it is probably the wrong thing to do. But having something in there akin to nanotime would be “very nice”…and would allow me to implement something for Go’s Windows service package that uses it to ship things to eventlog automatically.

Anyway, I realize pushing this too hard is an loosing battle, for good reason. But if somehow the implementation happened to be structured in a way that made it coincidentally possible to go:linkname it from elsewhere, I would be very happy.

@zx2c4 I’m not fond of documenting a callback to use with go:linkname either. I tend to think that people who operate at that level can customize their Go distribution.

Should we just have GODEBUG=debugfd=3?

/cc @aclements

@egonelbre how do you propose to use your API on Windows?

I agree. I wanted this feature to capture crash dumps of a Windows service.

/cc @zx2c4

Alex