go: runtime: performance problem with many Cgo calls

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.8 darwin/amd64, though I’ve also verified this happens on Linux.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/pmattis/Development/go"
GORACE=""
GOROOT="/Users/pmattis/Development/go-1.8"
GOTOOLDIR="/Users/pmattis/Development/go-1.8/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/qc/fpqpgdqd167c70dtc6840xxh0000gn/T/go-build085228252=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

Background

CockroachDB internally uses RocksDB (a fork of LevelDB) for data storage. We access RocksDB via a small Cgo wrapper. All writes to RocksDB are performed using WriteBatches which are first written to a write-ahead-log and then inserted into the database. Internally, RocksDB “groups” multiple concurrent write batches together into a single append to the write-ahead-log. While investigating performance, we noticed that the size of the groups was smaller than expected.

What did you do?

TL;DR? If write batch grouping is performed in Go performance is good. If it is performed in C++ (either by code that is nearly identical to the Go grouping or by RocksDB) performance is bad.

https://github.com/cockroachdb/cockroach/pull/14138 started as an experiment to replace the grouping of write batches in RocksDB with our own grouping in Go to try and understand the smaller than expected group sizes. In addition to fixing the group sizes, it improved performance on one experiment by 100% while reducing latencies. This was unexpected as the Cgo call overheads are negligible in comparison to the cost of committing the batch to RocksDB. In order to point the finger at RocksDB, we reimplemented the grouping of write batches in our Cgo RocksDB wrapper. Performance was equivalent to RocksDB performing the grouping.

I’ve provided a stripped down reproduction scenario at https://github.com/petermattis/batchtest. Running on my Mac laptop I see:

~ ./batchtest -t go
_elapsed____ops/sec
      1s     8029.6
      2s     8432.0
      3s     8100.1
      4s     8296.5
      5s     8171.8
      6s     8282.0
      7s     8040.7
      8s     8133.3
      9s     8240.4
     10s     8221.7
~ ./batchtest -t cgo
_elapsed____ops/sec
      1s     5036.4
      2s     2242.4
      3s     1284.3
      4s     1245.5
      5s     1254.5
      6s     1246.1
      7s     1962.5
      8s     4291.7
     10s     2036.3

By default, batchtest uses 100 concurrent worker threads writing “batches”. My suspicion is that batchtest and CockroachDB are tickling some badness in the Go scheduler. If I set GOMAXPROCS to the number of concurrent workers the cgo performance gets much closer to the Go performance:

~ GOMAXPROCS=100 ./batchtest -t cgo
_elapsed____ops/sec
      1s     6943.9
      2s     7252.0
      3s     7090.4
      4s     6810.8
      5s     7326.4
      6s     7758.9
      7s     7897.7
      8s     7893.2
      9s     7022.5
     10s     6875.9

https://github.com/cockroachdb/cockroach/pull/14138 is an acceptable workaround for committing batches, but it would be great to understand the performance discrepancy here. We’re concerned about other Cgo operations in CockroachDB that don’t have such easy workarounds.

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 10
  • Comments: 17 (8 by maintainers)

Most upvoted comments

There were several projects which were being rewritten in C or C++, because of cgo overhead. One example : https://github.com/canonical/dqlite/blob/master/doc/faq.md#why-c When I tried to introduce go to my team, same problem (In 2018).

Can somebody try to work on this?

Thank you.

Let me chime in. ( I came from https://github.com/golang/go/issues/21827#issuecomment-366332843 and the sympthoms are similar to what is described here so I decided to continue this thread instead of creating new issue. Please correct me if I am wrong. )

First some context:

Similarly to Cockroach who is using CGo wrappers for RocksDB I observed performance anomalies while using CGo wrappers for SQLite where presense of other goroutines combined with Cgo calls on “main” one show big overhead: github.com/mattn/go-sqlite3 uses CGo and performs several CGo calls inside one Query or Exec. There was also an interrupt goroutine spawned for every Query or Exec to call sqlite3_interrupt if context is canceled.

With Go1.10 avoiding to spawn that interrupt goroutine, if we know the context cannot be canceled, brings ~ 1.5x speedup to Exec (https://github.com/mattn/go-sqlite3/pull/530).

However Query was made faster only by 5% because after 2b283ce database/sql itself always spawns a goroutine to close Rows on context or transaction cancel. Avoiding to spawn Rows.awaitDone goroutine brings ~ 1.5x speedup to Query too (https://github.com/golang/go/issues/23879).


The theme here is that CGo calls show much higher overhead when performed not under simple only-one gorouitine scenario but in the presence of other goroutines. So let’s make some analysis:

First, on my machine I can observe that under simple 1-goroutine-only scenario the cost of making 1 trivial CGo call is ~ 60ns, it scales linearly to the number of calls, it is independent of whether GOMAXPROCS is 1 or not, and it is ~ 35x slower compared to making a trivial Go call:

CallGo1              1.71ns ± 0%
CallGo1-4            1.71ns ± 0%
CallGo10             16.0ns ± 0%
CallGo10-4           16.1ns ± 0%
CallCGo1             58.3ns ± 1%
CallCGo1-4           58.6ns ± 1%
CallCGo10             626ns ± 0%
CallCGo10-4           611ns ± 5%

Then we can check what happens in the presence of other goroutines by simulating client-server requests over channel with client and server running in its own goroutines and server sequentially handling requests (so 2 goroutines in total) via making 1 trivial either Go or CGo call:

RTTSeqServerGo1     396ns ± 0%
RTTSeqServerGo1-4   458ns ± 0%
RTTSeqServerCGo1    454ns ± 0%
RTTSeqServerCGo1-4  536ns ± 1%

what can be seen here is that for GOMAXPROCS=1 δ(Go1,CGo1) ≈ t(CGo call), both Go and CGo RTT times are bigger for GOMAXPROCS!=1 with δ(Go1-4,CGo1-4) also growing to ~80ns.

It is somewhat understandable that processing time increases with GOMAXPROCS!=1 even for Go-only case because sometimes there now can be inter OS-threads communications, so in the context of this issue there is probably no problem here (though those 2 goroutines run almost not overlapping with each other and thus there should not be a need to put them to different OS threads). However δ(Go1,CGo1) increase over GOMAXPROCS=1 case suggests there might be something else going on. 20ns is however small and might be all noise, so let’s recheck what happens when server performs 10 calls in the handler:

RTTSeqServerGo10      419ns ± 1%
RTTSeqServerGo10-4    486ns ± 1%
RTTSeqServerCGo10    1.00µs ± 0%
RTTSeqServerCGo10-4  1.97µs ± 4%

Go10, Go10-4 and CGo10 show time increase proportional to ~ 9·t(corresponding call). However CGo10-4 (i.e. CGo case with GOMAXPROCS!=1) is now 2x slower than CGo10 (i.e. the same case with GOMAXPROCS=1) by ~1µs.

The difference in between CGo10 and CGo10-4 can be attributed to inter OS-thread communications: the strace for CGo10 case is almost empty and contains only 10ms sleeps from sysmon during active phase:

1403  14:41:40.757900 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403  14:41:40.767974 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
1403  14:41:40.778049 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL) = 0 (Timeout)
...

However CGo10-4 shows different picture under strace - during active phase there is now constant high-rate churn of futex wait and wake in between 2 OS threads with 1-10 millisecond sleeps from sysmon also showing seldomly:

1447  14:42:41.733626 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733631 <... futex resumed> ) = 0
1447  14:42:41.733636 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733645 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733664 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733669 <... futex resumed> ) = 0
1447  14:42:41.733674 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733697 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733703 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733708 <... futex resumed> ) = 0
1447  14:42:41.733713 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733722 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733728 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733733 <... futex resumed> ) = 0
1447  14:42:41.733738 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733747 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
1447  14:42:41.733753 futex(0xc420012d48, FUTEX_WAIT, 0, NULL <unfinished ...>
1452  14:42:41.733758 <... futex resumed> ) = 0
1447  14:42:41.733763 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
1452  14:42:41.733772 futex(0xc420012d48, FUTEX_WAKE, 1 <unfinished ...>
...
# seldom
1446  14:42:41.225481 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446  14:42:41.235566 <... pselect6 resumed> ) = 0 (Timeout)
1446  14:42:41.235594 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
1446  14:42:41.245656 <... pselect6 resumed> ) = 0 (Timeout)
1446  14:42:41.245670 pselect6(0, NULL, NULL, NULL, {tv_sec=0, tv_nsec=10000000}, NULL <unfinished ...>
...

This strace suggests that for CGo10-4 the scheduler somehow decided to keep one goroutine on one OS thread and another goroutine on another OS thread and then channel communications in between them somehow becomes futex ops which bring the overhead.

The draft notesleep spinning patch from https://github.com/golang/go/issues/21827#issue-256508518 can mitigate the CGo10-4 problem somewhat:

                     Go1.10       Go1.10+notesleep-spin
name                 old time/op  new time/op  delta
RTTSeqServerGo1       397ns ± 0%   405ns ± 0%   +2.01%  (p=0.000 n=9+10)
RTTSeqServerGo1-4     461ns ± 1%   462ns ± 0%     ~     (p=0.640 n=10+9)
RTTSeqServerCGo1      456ns ± 0%   466ns ± 0%   +2.17%  (p=0.000 n=9+9)
RTTSeqServerCGo1-4    532ns ± 1%   532ns ± 1%     ~     (p=0.751 n=10+10)
RTTSeqServerGo10      419ns ± 1%   424ns ± 0%   +1.17%  (p=0.000 n=10+10)
RTTSeqServerGo10-4    486ns ± 1%   484ns ± 1%   -0.51%  (p=0.018 n=10+10)
RTTSeqServerCGo10    1.00µs ± 0%  1.02µs ± 0%   +1.44%  (p=0.000 n=9+10)
RTTSeqServerCGo10-4  1.97µs ± 4%  1.13µs ± 1%  -42.92%  (p=0.000 n=10+10)

However it is not clear why the fact of making 10 CGo calls in sequence affected the scheduler to make this decision: the individual CGo call time is well under 20µs - the time, if I understood correctly, after which sysmon decides that something is hanging in a CGo or syscall for too much and removes current OS thread from a GOMAXPROCS slot. The fact that sysmon is sleeping for >1ms in strace also confirm that sysmon does not intervene here.

I do not know Go scheduler insides, but clearly making several CGo call in sequence shakes it enough to bring some disorder and slowdown. For the reference when I was seeing the problem with SQLite my query time was ~ 5-6µs if I use the go driver directly and with standard

go func() {
	select {
	case <-ctx.Done():
		// call some cancel

	case <-done:
		// work finished ok
	}
}()

// perform query
close(done)

spawned around query with ctx never being canceled it was increasing the time to ~10µs. In other words it can be not only 1µs of additional penalty but higher.

In summary, like @petermattis said it looks like CGo calls sequence trigers some badness in Go scheduler that would be good to understand and hopefully fix.

Thanks beforehand, Kirill

/cc @ianlancetaylor, @dvyukov, @rsc, @aclements, @bcmills, @mattn


(cgo.go)

package cgotest

// int argadjust(int arg) { return 3 + arg; }
import "C"

// XXX here because cannot use C in tests directly
func cargadjust(arg int) int {
        return int(C.argadjust(C.int(arg)))
}

(cgo_test.go)

package cgotest

import "testing"

//go:noinline
func goargadjust(i int) int {
	return i + 3
}

// BenchmarkCallGo1 measures overhead of 1 Go call.
func BenchmarkCallGo1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		x := goargadjust(i)
		if x != i + 3 {
			b.Fatalf("goargadjust -> %d  ; want %d", x, i + 3)
		}
	}
}

// BenchmarkCallGo10 measures overhead of 10 Go calls performed sequentially.
func BenchmarkCallGo10(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0; j < 10; j++ {
			x := goargadjust(i)
			if x != i + 3 {
				b.Fatalf("goargadjust -> %d  ; want %d", x, i + 3)
			}
		}
	}
}


// BenchmarkCallCGo1 measures overhead of 1 CGo call.
func BenchmarkCallCGo1(b *testing.B) {
	for i := 0; i < b.N; i++ {
		x := cargadjust(i)
		if x != i + 3 {
			b.Fatalf("cargadjust -> %d  ; want %d", x, i + 3)
		}
	}
}

// BenchmarkCallCGo10 measures overhead of 10 CGo calls performed sequentially.
func BenchmarkCallCGo10(b *testing.B) {
	for i := 0; i < b.N; i++ {
		for j := 0; j < 10; j++ {
			x := cargadjust(i)
			if x != i + 3 {
				b.Fatalf("cargadjust -> %d  ; want %d", x, i + 3)
			}
		}
	}
}


// ---------

type req struct {
	respq chan *resp
	arg   int
}

type resp struct {
	ret int
}

// client performs one request-reply cycle to a server over srvlink.
func client(srvlink chan *req, arg int) int {
	rc := make(chan *resp)
	srvlink <- &req{
		respq: rc,
		arg:   arg,
	}
	ret := <-rc
	return ret.ret
}

// server receives requests over inq, passes received request to
// handler, and returns result back over response channel.
//
// server runs until it processes all requests from inq.
type server func(inq chan *req, handler func(int) int)

// seqServer sequentially handles requests received over inq.
func seqServer(inq chan *req, handler func(int) int) {
	for r := range inq {
		r.respq <- &resp{ret: handler(r.arg)}
	}
}

// goServer concurently handles requests received over inq.
func goServer(inq chan *req, handler func(int) int) {
	for r := range inq {
		r := r
		go func() {
			r.respq <- &resp{ret: handler(r.arg)}
		}()
	}
}

// benchmarkRTT measures client-server round-trip latency for a particular
// server implementation.
func benchmarkRTT(b *testing.B, srv func(chan *req)) {
	inc := make(chan *req)
	go srv(inc)
	for i := 0; i < b.N; i++ {
		client(inc, i)
	}
	close(inc)
}


// --------

// mkServerGo1 returns function that will runs srv with N Go calls per request.
func mkServerGo1N(srv server, n int) func(chan *req) {
	return func(inq chan *req) {
		srv(inq, func(arg int) int {
			for i := 0; i < n; i++ {
				arg = goargadjust(arg)
			}
			return arg
		})
	}
}

// mkServerCGoN returns function that will run srv with N CGo calls per request.
func mkServerCGoN(srv server, n int) func(chan *req) {
	return func(inq chan *req) {
		srv(inq, func(arg int) int {
			for i := 0; i < n; i++ {
				arg = cargadjust(arg)
			}
			return arg
		})
	}
}

func BenchmarkRTTSeqServerGo1(b *testing.B)	{ benchmarkRTT(b, mkServerGo1N(seqServer, 1)) }
//func BenchmarkRTTGoServerGo1(b *testing.B)	{ benchmarkRTT(b, mkServerGo1(goServer)) }

func BenchmarkRTTSeqServerCGo1(b *testing.B)	{ benchmarkRTT(b, mkServerCGoN(seqServer, 1)) }
//func BenchmarkRTTGoServerCGo1(b *testing.B)	{ benchmarkRTT(b, mkServerCGoN(goServer,  1)) }

func BenchmarkRTTSeqServerGo10(b *testing.B)	{ benchmarkRTT(b, mkServerGo1N(seqServer, 10)) }
func BenchmarkRTTSeqServerCGo10(b *testing.B)	{ benchmarkRTT(b, mkServerCGoN(seqServer, 10)) }
//func BenchmarkRTTGoServerCGo10(b *testing.B)	{ benchmarkRTT(b, mkServerCGoN(goServer,  10)) }