go: runtime: performance problem with many Cgo calls
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (go version
)?
go version go1.8 darwin/amd64
, though I’ve also verified this happens on Linux.
What operating system and processor architecture are you using (go env
)?
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build085228252=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
Background
CockroachDB internally uses RocksDB (a fork of LevelDB) for data storage. We access RocksDB via a small Cgo wrapper. All writes to RocksDB are performed using WriteBatches
which are first written to a write-ahead-log and then inserted into the database. Internally, RocksDB “groups” multiple concurrent write batches together into a single append to the write-ahead-log. While investigating performance, we noticed that the size of the groups was smaller than expected.
What did you do?
TL;DR? If write batch grouping is performed in Go performance is good. If it is performed in C++ (either by code that is nearly identical to the Go grouping or by RocksDB) performance is bad.
https://github.com/cockroachdb/cockroach/pull/14138 started as an experiment to replace the grouping of write batches in RocksDB with our own grouping in Go to try and understand the smaller than expected group sizes. In addition to fixing the group sizes, it improved performance on one experiment by 100% while reducing latencies. This was unexpected as the Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB. In order to point the finger at RocksDB, we reimplemented the grouping of write batches in our Cgo RocksDB wrapper. Performance was equivalent to RocksDB performing the grouping.
I’ve provided a stripped down reproduction scenario at https://github.com/petermattis/batchtest. Running on my Mac laptop I see:
~ ./batchtest -t go
_elapsed____ops/sec
1s 8029.6
2s 8432.0
3s 8100.1
4s 8296.5
5s 8171.8
6s 8282.0
7s 8040.7
8s 8133.3
9s 8240.4
10s 8221.7
~ ./batchtest -t cgo
_elapsed____ops/sec
1s 5036.4
2s 2242.4
3s 1284.3
4s 1245.5
5s 1254.5
6s 1246.1
7s 1962.5
8s 4291.7
10s 2036.3
By default, batchtest
uses 100 concurrent worker threads writing “batches”. My suspicion is that batchtest
and CockroachDB are tickling some badness in the Go scheduler. If I set GOMAXPROCS
to the number of concurrent workers the cgo performance gets much closer to the Go performance:
~ GOMAXPROCS=100 ./batchtest -t cgo
_elapsed____ops/sec
1s 6943.9
2s 7252.0
3s 7090.4
4s 6810.8
5s 7326.4
6s 7758.9
7s 7897.7
8s 7893.2
9s 7022.5
10s 6875.9
https://github.com/cockroachdb/cockroach/pull/14138 is an acceptable workaround for committing batches, but it would be great to understand the performance discrepancy here. We’re concerned about other Cgo operations in CockroachDB that don’t have such easy workarounds.
About this issue
- Original URL
- State: open
- Created 7 years ago
- Reactions: 10
- Comments: 17 (8 by maintainers)
There were several projects which were being rewritten in C or C++, because of cgo overhead. One example : https://github.com/canonical/dqlite/blob/master/doc/faq.md#why-c When I tried to introduce go to my team, same problem (In 2018).
Can somebody try to work on this?
Thank you.
Let me chime in. ( I came from https://github.com/golang/go/issues/21827#issuecomment-366332843 and the sympthoms are similar to what is described here so I decided to continue this thread instead of creating new issue. Please correct me if I am wrong. )
First some context:
Similarly to Cockroach who is using CGo wrappers for RocksDB I observed performance anomalies while using CGo wrappers for SQLite where presense of other goroutines combined with Cgo calls on “main” one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call
sqlite3_interrupt
if context is canceled.With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (https://github.com/mattn/go-sqlite3/pull/530).
However Query was made faster only by 5% because after 2b283ce database/sql itself always spawns a goroutine to close Rows on context or transaction cancel. Avoiding to spawn
Rows.awaitDone
goroutine brings ~ 1.5x speedup to Query too (https://github.com/golang/go/issues/23879).The theme here is that CGo calls show much higher overhead when performed not under simple only-one gorouitine scenario but in the presence of other goroutines. So let’s make some analysis:
First, on my machine I can observe that under simple 1-goroutine-only scenario the cost of making 1 trivial CGo call is ~ 60ns, it scales linearly to the number of calls, it is independent of whether GOMAXPROCS is 1 or not, and it is ~ 35x slower compared to making a trivial Go call:
Then we can check what happens in the presence of other goroutines by simulating client-server requests over channel with client and server running in its own goroutines and server sequentially handling requests (so 2 goroutines in total) via making 1 trivial either Go or CGo call:
what can be seen here is that for GOMAXPROCS=1 δ(Go1,CGo1) ≈ t(CGo call), both Go and CGo RTT times are bigger for GOMAXPROCS!=1 with δ(Go1-4,CGo1-4) also growing to ~80ns.
It is somewhat understandable that processing time increases with GOMAXPROCS!=1 even for Go-only case because sometimes there now can be inter OS-threads communications, so in the context of this issue there is probably no problem here (though those 2 goroutines run almost not overlapping with each other and thus there should not be a need to put them to different OS threads). However δ(Go1,CGo1) increase over GOMAXPROCS=1 case suggests there might be something else going on. 20ns is however small and might be all noise, so let’s recheck what happens when server performs 10 calls in the handler:
Go10, Go10-4 and CGo10 show time increase proportional to ~ 9·t(corresponding call). However CGo10-4 (i.e. CGo case with GOMAXPROCS!=1) is now 2x slower than CGo10 (i.e. the same case with GOMAXPROCS=1) by ~1µs.
The difference in between CGo10 and CGo10-4 can be attributed to inter OS-thread communications: the strace for CGo10 case is almost empty and contains only 10ms sleeps from sysmon during active phase:
However CGo10-4 shows different picture under strace - during active phase there is now constant high-rate churn of futex wait and wake in between 2 OS threads with 1-10 millisecond sleeps from sysmon also showing seldomly:
This strace suggests that for CGo10-4 the scheduler somehow decided to keep one goroutine on one OS thread and another goroutine on another OS thread and then channel communications in between them somehow becomes futex ops which bring the overhead.
The draft notesleep spinning patch from https://github.com/golang/go/issues/21827#issue-256508518 can mitigate the CGo10-4 problem somewhat:
However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision: the individual CGo call time is well under 20µs - the time, if I understood correctly, after which sysmon decides that something is hanging in a CGo or syscall for too much and removes current OS thread from a GOMAXPROCS slot. The fact that sysmon is sleeping for >1ms in strace also confirm that sysmon does not intervene here.
I do not know Go scheduler insides, but clearly making several CGo call in sequence shakes it enough to bring some disorder and slowdown. For the reference when I was seeing the problem with SQLite my query time was ~ 5-6µs if I use the go driver directly and with standard
spawned around query with ctx never being canceled it was increasing the time to ~10µs. In other words it can be not only 1µs of additional penalty but higher.
In summary, like @petermattis said it looks like CGo calls sequence trigers some badness in Go scheduler that would be good to understand and hopefully fix.
Thanks beforehand, Kirill
/cc @ianlancetaylor, @dvyukov, @rsc, @aclements, @bcmills, @mattn
(cgo.go)
(cgo_test.go)