go: runtime: defer is slow

package p
import "testing"
//go:noinline
func defers() (r int) {
        defer func() {
                r = 42
        }()
        return 0
}
func BenchmarkDefer(b *testing.B) {
        for i := 0; i < b.N; i++ {
                defers()
        }
}

On my system, BenchmarkDefer uses 77.7ns/op. This issue arises from investigation of #9704: if I remove the “defer endcgo(mp)” and place the call at the end of the func cgocall, the benchmark in #9704 will improve from 144ns/op to 63.7ns/op. (Note: we can’t

eliminate the defer in func cgocall though, as it will break defer/recover in Go->C->Go scenario.)

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 31
  • Comments: 42 (35 by maintainers)

Commits related to this issue

Most upvoted comments

Change https://golang.org/cl/171758 mentions this issue: cmd/compile,runtime: allocate defer records on the stack

I think this is done, so I am going to close it. We can always make things faster. If there are specific ideas for making defers faster, let’s do them in separate issues.

Thanks to all especially @danscales .

If we had more data on this, perhaps we could focus on those cases.

I’m working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application’s CPU usage, split fairly evenly between each.

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

We’ve doubled the performance of defer for Go 1.8, but it could still be faster, so I’m moving this to Go 1.9.

Defer is much faster than it used to be. This issue is still open because there’s more we could do, but we haven’t been resting on our laurels, either.

In what situations is defer “prohibitively expensive”? If we had more data on this, perhaps we could focus on those cases.

Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn’t have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn’t need to call systemstack.

Given the current cost of defers, I think it’s acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don’t think the runtime side of this is very complicated.

This is on the list for 1.8. I’m planning to either do it myself or get someone else to do it. 😃

If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue.

@dr2chase and I have been discussing an alternate approach to this that’s somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies.

The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic.

Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did).

To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset.

When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them.

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer’s direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19d99, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it’s about 1.5%. With the development branch at be64a19d99, it’s down to 0.5%.

Zooming out to the application’s total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer “in” a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it’s effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

@starius, interesting case.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer’s arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

As you point out, it’s important that mutex doesn’t change, or else this transformation isn’t actually sound. Unfortunately, while that sort of thing is easy to detect in SSA, defer is mostly implemented in the front-end, where it’s much harder to tell.

For the defer optimization I’d like to ultimately do, though, I suspect this would fall out naturally, since the defer call would simply be transformed into a function call at the end of the function (plus some panic handling metadata), so SSA would recognize that the argument saved at the defer point is unchanged at the prologue and CSE them.

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn’t it have worked faster after https://go-review.googlesource.com/c/29656/ ?

I suspect this is noise. CL 29656 was released in Go 1.8, and the defer implementation hasn’t changed significantly (perhaps at all) between 1.8 and master.

CL https://golang.org/cl/29656 mentions this issue.