go: runtime: select on a shared channel is slow with many Ps

@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of net/http.Server.Shutdown that’s currently implemented by closing a channel that all the open connections select on.

(Details in https://github.com/kubernetes/kubernetes/issues/45216 and https://github.com/golang/go/issues/20302)

But the short summary is:

package selbench
        
import "testing"
        
func BenchmarkSelectShared(b *testing.B) {
        idleShared := make(chan struct{})
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idleShared:
                        }
                }
        })      
}       

func BenchmarkSelectPrivate(b *testing.B) {
        b.RunParallel(func(pb *testing.PB) {
                ch := make(chan int, 1)
                idlePrivate := make(chan struct{})
                for pb.Next() {
                        select {
                        case ch <- 1:
                        case <-ch:
                        case <-idlePrivate:
                        }
                }
        })
}

Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.

But when the channel is private to the goroutine (uncontended), things are fast. When it’s a shared channel, things are slow.

$ go test -v -bench=Select -cpu=1,2,4,8,16,32,64
BenchmarkSelectShared           10000000               194 ns/op
BenchmarkSelectShared-2         10000000               147 ns/op
BenchmarkSelectShared-4          5000000               395 ns/op
BenchmarkSelectShared-8          3000000               449 ns/op
BenchmarkSelectShared-16         5000000               354 ns/op
BenchmarkSelectShared-32         5000000               320 ns/op
BenchmarkSelectShared-64         5000000               296 ns/op
BenchmarkSelectPrivate          10000000               192 ns/op
BenchmarkSelectPrivate-2        20000000                98.0 ns/op
BenchmarkSelectPrivate-4        30000000                49.3 ns/op
BenchmarkSelectPrivate-8        50000000                25.5 ns/op
BenchmarkSelectPrivate-16       100000000               13.8 ns/op
BenchmarkSelectPrivate-32       200000000                7.07 ns/op
BenchmarkSelectPrivate-64       200000000                6.31 ns/op

Are there any optimizations to be done here?

We’ll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.

/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 8
Comments: 24 (24 by maintainers)

Commits related to this issue

peer, examples/benchmark*: allow for a handler to be passed to process non-rpc messages in a bounded worker pool — committed to perlin-network/noise by iwasaki-kenta 5 years ago

Most upvoted comments

@dvyukov, see also the original benchmark at top. But ignoring specific benchmarks (flawed or not), do you have any thoughts?

I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.

The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:

BenchmarkSelectContended/workers=1000         	  200000	       583 ns/op
BenchmarkSelectContended/workers=1000-64      	  200000	       708 ns/op

rescaled to op/per-core is actually (assuming that the machine does have 64 cores):

BenchmarkSelectContended/workers=1000         	  200000	       583 ns/op
BenchmarkSelectContended/workers=1000-64      	  200000	     45312 ns/op

so in some cases we see up to 100x slowdown.

There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don’t think it’s worth it. At least we have lots of lower hanging fruits.

Now, what I think we should do to improve things to some degree is:

Take the finer-grained locking in select logic from https://codereview.appspot.com/12544043 and apply it. It’s orthogonal to the lock-free stuff. It will considerably reduce duration of critical section.
Optimize runtime mutex more. Off the top of my head: apply the so called “thundering herd” optimization; don’t call futex wake when there is nobody to wake; tune spinning logic; maybe don’t wake waiters when there are spinning threads.
We may try the so called combining locks (https://software.intel.com/en-us/blogs/2013/02/22/combineraggregator-synchronization-primitive). Sometimes they are useful to somewhat improve behavior of highly contented data structures.
Do some spot optimizations in the select, esp inside critical sections.

And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that’s what http2 does, and these additional chans significantly prolong critical section in select.

+11

dvyukov on May 16, 2017

Honestly I don’t see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something.

ianlancetaylor on May 28, 2020

The root problem is lock contention in runtime.sellock and runtime.selunlock. One idea is to replace runtime.mutex with a scalable lock, such as an MCS lock.

tombergan on May 13, 2017