go: runtime: select on a shared channel is slow with many Ps
@tombergan and I have been debugging a severe performance regression that Kubernetes observed when switching from Go 1.7 to Go 1.8. The problem ended up being the addition of net/http.Server.Shutdown
that’s currently implemented by closing a channel that all the open connections select on.
(Details in https://github.com/kubernetes/kubernetes/issues/45216 and https://github.com/golang/go/issues/20302)
But the short summary is:
package selbench
import "testing"
func BenchmarkSelectShared(b *testing.B) {
idleShared := make(chan struct{})
b.RunParallel(func(pb *testing.PB) {
ch := make(chan int, 1)
for pb.Next() {
select {
case ch <- 1:
case <-ch:
case <-idleShared:
}
}
})
}
func BenchmarkSelectPrivate(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
ch := make(chan int, 1)
idlePrivate := make(chan struct{})
for pb.Next() {
select {
case ch <- 1:
case <-ch:
case <-idlePrivate:
}
}
})
}
Note that the idle channels below are never closed and are never selectable, but the other two are, and stay busy.
But when the channel is private to the goroutine (uncontended), things are fast. When it’s a shared channel, things are slow.
$ go test -v -bench=Select -cpu=1,2,4,8,16,32,64
BenchmarkSelectShared 10000000 194 ns/op
BenchmarkSelectShared-2 10000000 147 ns/op
BenchmarkSelectShared-4 5000000 395 ns/op
BenchmarkSelectShared-8 3000000 449 ns/op
BenchmarkSelectShared-16 5000000 354 ns/op
BenchmarkSelectShared-32 5000000 320 ns/op
BenchmarkSelectShared-64 5000000 296 ns/op
BenchmarkSelectPrivate 10000000 192 ns/op
BenchmarkSelectPrivate-2 20000000 98.0 ns/op
BenchmarkSelectPrivate-4 30000000 49.3 ns/op
BenchmarkSelectPrivate-8 50000000 25.5 ns/op
BenchmarkSelectPrivate-16 100000000 13.8 ns/op
BenchmarkSelectPrivate-32 200000000 7.07 ns/op
BenchmarkSelectPrivate-64 200000000 6.31 ns/op
Are there any optimizations to be done here?
We’ll work around it in the meantime and generally keep this in mind as an anti-pattern for performance.
/cc @aclements @randall77 @ianlancetaylor @rsc @josharian @mdempsky
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 8
- Comments: 24 (24 by maintainers)
Commits related to this issue
- peer, examples/benchmark*: allow for a handler to be passed to process non-rpc messages in a bounded worker pool — committed to perlin-network/noise by iwasaki-kenta 5 years ago
I think this case is inherently hard. Multiple threads are hammering the same complex mutex-protected object. When memory is write-shared a simple memory load can take up to 100ns. And the mutex makes it all worse.
The numbers are need to be interpreted carefully. We divide work across multiple threads. So this:
rescaled to op/per-core is actually (assuming that the machine does have 64 cores):
so in some cases we see up to 100x slowdown.
There is no easy way to fix this entirely. That would require some monstrous distributed construct that consumes considerably more memory and then some logic to detect when a chan actually falls into this case and dynamically alter the algorithm to a different scheme, and then alter it back if we see that we misguessed (e.g. sends appear of what we though is a close-notification chan). I don’t think it’s worth it. At least we have lots of lower hanging fruits.
Now, what I think we should do to improve things to some degree is:
And a more realistic benchmark for this case would be to add a bunch of local non-ready chans to the select, because that’s what http2 does, and these additional chans significantly prolong critical section in select.
Honestly I don’t see my suggestion as the same as yours, since I was suggesting looking at all the channels without taking any locks, which is not what your code does. I may well be missing something.
The root problem is lock contention in
runtime.sellock
andruntime.selunlock
. One idea is to replaceruntime.mutex
with a scalable lock, such as an MCS lock.