go: runtime: defer is slow

package p
import "testing"
//go:noinline
func defers() (r int) {
        defer func() {
                r = 42
        }()
        return 0
}
func BenchmarkDefer(b *testing.B) {
        for i := 0; i < b.N; i++ {
                defers()
        }
}

On my system, BenchmarkDefer uses 77.7ns/op. This issue arises from investigation of #9704: if I remove the “defer endcgo(mp)” and place the call at the end of the func cgocall, the benchmark in #9704 will improve from 144ns/op to 63.7ns/op. (Note: we can’t

eliminate the defer in func cgocall though, as it will break defer/recover in Go->C->Go scenario.)

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 31
Comments: 42 (35 by maintainers)

Commits related to this issue

Replace defer unlocks with explicit ones Apparently defer is not free: https://github.com/golang/go/issues/14939 — committed to redbaron/prometheus by redbaron 8 years ago
Avoid `defer` in seriesMap.get This is related to https://github.com/golang/go/issues/14939 . It's probably the only occurrence where it matters. — committed to prometheus/prometheus by deleted user 8 years ago
runtime: implement getcallersp in Go This makes it possible to inline getcallersp. getcallersp is on the hot path of defers, so this slightly speeds up defer: name old time/op new time/op... — committed to golang/go by aclements 8 years ago
runtime: optimize defer code This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Curre... — committed to golang/go by aclements 8 years ago
runtime: optimize defer code This optimizes deferproc and deferreturn in various ways. The most important optimization is that it more carefully arranges to prevent preemption or stack growth. Curre... — committed to unclejack/go by aclements 8 years ago
merge to master (#33) * remove stuff * RandomPoolRouter and RandomGroupRouter (#13) * Routing example: add message number It helps to add the message number in the output of the example, th... — committed to asynkron/protoactor-go by rogeralsing 8 years ago
Changed locking strategy (#37) There is a known issue in Go with deferred operations being much slower than explicit operations, both directly and indirectly impacting locking. Generally, on OS X and... — committed to allegro/bigcache by deleted user 7 years ago
decoder: Fix panic on dict with non-comparable keys Even though we tried to catch whether dict keys are ok to be used via reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes found by... — committed to navytux/og-rek by navytux 6 years ago
decoder: Fix panic on dict with non-comparable keys Even though we tried to catch whether dict keys are ok to be used via reflect.TypeOf(key).Comparable() (see da5f0342 "decoder: Fix crashes found by... — committed to kisielk/og-rek by navytux 6 years ago
cmd/compile,runtime: allocate defer records on the stack When a defer is executed at most once in a function body, we can allocate the defer record for it on the stack instead of on the heap. This s... — committed to golang/go by randall77 5 years ago
cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata Generate inline code at defer time to save the args of defer calls to unique (autotmp) stack slots, and gen... — committed to golang/go by danscales 5 years ago
cmd/compile, cmd/link, runtime: make defers low-cost through inline code and extra funcdata Generate inline code at defer time to save the args of defer calls to unique (autotmp) stack slots, and gen... — committed to golang/go by danscales 5 years ago
crypto/tls: move a defer out of a loop Rhys Hiltner noted in #14939 that this defer was syntactically inside a loop, but was only ever executed once. Now that defer in a loop is significantly slower... — committed to golang/go by josharian 5 years ago
Changed locking strategy (#37) There is a known issue in Go with deferred operations being much slower than explicit operations, both directly and indirectly impacting locking. Generally, on OS X and... — committed to flisky/bigcache by deleted user 7 years ago

Most upvoted comments

Change https://golang.org/cl/171758 mentions this issue: cmd/compile,runtime: allocate defer records on the stack

+17

gopherbot on Apr 23, 2019

I think this is done, so I am going to close it. We can always make things faster. If there are specific ideas for making defers faster, let’s do them in separate issues.

Thanks to all especially @danscales .

+10

ianlancetaylor on Dec 5, 2019

If we had more data on this, perhaps we could focus on those cases.

I’m working on a high-throughput HTTPS server which shows the cost of defer in crypto/tls and internal/poll fairly clearly in its CPU profiles. The following five defer sites account for several percent of the application’s CPU usage, split fairly evenly between each.

crypto/tls.(*Conn).Write … shows up in the source inside a for loop, but could be moved outside: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1036
crypto/tls.(*Conn).Write again: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1046
crypto/tls.(*Conn).Handshake: https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L1259
crypto/tls.(*Conn).writeRecordLocked (for buffer management … may be absent in go1.12beta2): https://github.com/golang/go/blob/go1.11.4/src/crypto/tls/conn.go#L868
internal/poll.(*FD).Write: https://github.com/golang/go/blob/go1.11.4/src/internal/poll/fd_unix.go#L258

These particular defers are in place unconditionally, so a minimal PC-range-based compiler change as @dr2chase and @aclements discussed in September 2016 might be enough.

+10

rhysh on Jan 15, 2019

We’ve doubled the performance of defer for Go 1.8, but it could still be faster, so I’m moving this to Go 1.9.

+10

aclements on Oct 11, 2016

Defer is much faster than it used to be. This issue is still open because there’s more we could do, but we haven’t been resting on our laurels, either.

In what situations is defer “prohibitively expensive”? If we had more data on this, perhaps we could focus on those cases.

aclements on Jan 15, 2019

Separate from stack allocation, we should also consider special-casing defers with no arguments, as that is a fairly common case (about half of the defer calls in the standard library). Because the no-argument case doesn’t have to worry about the arguments on the stack, it can use a simpler version of deferproc, one that doesn’t need to call systemstack.

ianlancetaylor on Jul 1, 2016

Given the current cost of defers, I think it’s acceptable to have two defer allocation mechanisms if that addresses the problem. And I actually don’t think the runtime side of this is very complicated.

This is on the list for 1.8. I’m planning to either do it myself or get someone else to do it. 😃

If it turns out we need to simplify things, we could limit this to defers with at most one (possibly implicit) argument, which would handle the cgo case as well as the mutex unlock case. Another possible simplification would be to only stack-allocate defers in functions where all defers can be stack allocated, which would probably simplify creating an efficient prologue.

aclements on Jul 1, 2016

@dr2chase and I have been discussing an alternate approach to this that’s somewhat less general than stack-allocated defers but should have essentially zero overhead versus a function call when it applies.

The idea is to take a page out of the C++ exception handling book, open code the defer execution, and use a PC value table to figure out where the defer code is when handling a panic.

Specifically, if the set of defers can be statically determined at every PC in a function, then the compiler would turn those defers into closures built on the stack and generate code at every exit point to directly call the appropriate defer closures for that exit point. In the common case, then, there would be no deferreturn logic at all and the defer execution would look essentially like hand-expanding the defer (like what CL 29379 did).

To keep this working with panics, the compiler would generate a PC value table for every function where this optimization applies that logically records, for every PC, where in the stack frame to find the set of defer closures to run for that PC. This actual encoding of this table could be quite compact in general, since large runs of PCs will have the same set of defer closures, and we could encode the tree-like structure of the set of defers to run directly in this table, so each entry would contain at most one defer closure offset and the PC to use to look up the next defer closure offset.

When panic walks the stack, it would keep an eye on both this PC value table and the defer stack. A given frame could have either defers on the stack or a defer offset PC value table, but not both. If a frame has a defer offset PC value table, panic would use the table to find the defer closures and call them.

aclements on Sep 22, 2016

@danscales , the results for https://golang.org/cl/202340 look great: it eliminates about 80% of defer’s direct CPU cost in the application I described. A small change to crypto/tls would fix the remaining case.

For go1.12.12, go1.13.3, and be64a19d99, I counted profile samples that have crypto/tls.(*Conn).Write on the stack (with -focus=tls...Conn..Write) and profile samples that additionally have defer-related functions on the stack (saving the prior results and adding on -focus='^runtime\.(deferproc|deferreturn|jmpdefer)$'). With go1.12.12, about 2.5% of time in crypto/tls.(*Conn).Write is spent on defer. With go1.13.3, it’s about 1.5%. With the development branch at be64a19d99, it’s down to 0.5%.

Zooming out to the application’s total CPU spend on defer-related functions, more than 90% of the samples are caused by a single use of defer “in” a loop in crypto/tls.(*Conn).Write. If that call were outside of the loop—or if the compiler could prove that it’s effectively outside the loop already—then CL 202340 would all but eliminate the CPU cost of defer for the application (down to roughly 0.01%).

Thank you!

rhysh on Oct 25, 2019

@starius, interesting case.

The speedup of using lambda version is 1.07. Can Go compiler detect cases where defer’s arguments do not change (like mutex in my code) and introduce a lambda function to get that speedup?

As you point out, it’s important that mutex doesn’t change, or else this transformation isn’t actually sound. Unfortunately, while that sort of thing is easy to detect in SSA, defer is mostly implemented in the front-end, where it’s much harder to tell.

For the defer optimization I’d like to ultimately do, though, I suspect this would fall out naturally, since the defer call would simply be transformed into a function call at the end of the function (plus some panic handling metadata), so SSA would recognize that the argument saved at the defer point is unchanged at the prologue and CSE them.

It demonstrates almost the same speedup of using lambdas (1.06) but more important part is that the master works slower than 1.8.3. Shouldn’t it have worked faster after https://go-review.googlesource.com/c/29656/ ?

I suspect this is noise. CL 29656 was released in Go 1.8, and the defer implementation hasn’t changed significantly (perhaps at all) between 1.8 and master.

aclements on Aug 2, 2017

CL https://golang.org/cl/29656 mentions this issue.

gopherbot on Sep 22, 2016