go: runtime: syscalls are ~ 1.4x slower than in C

Update 20170317: part of original 1.7x slowness was found to be due to os.File regression, not syscall itself (https://github.com/golang/go/issues/19563#issuecomment-287296706, #19586). Raw Go syscalls are ~ 1.4x slower than in C.

Please answer these questions before submitting your issue. Thanks!

What did you do?

Hello up there. First of all let me please say right from the beginning that I understand that in Go, due to goroutines, there is more work to make a system call compared to C. Still please consider the following two programs which measure time to do a pread in Go and C:

package main

import (
        "flag"
        "fmt"
        "log"
        "os"
        "time"
)

//             12345678
const Niter = 1000000

// xpread is pread(2) but aborts on an error or when fewer bytes was read
func xpread(f *os.File, buf []byte, offset int64) {
        n, err := f.ReadAt(buf, offset)
        if err != nil {
                panic(err)
        }
        if n != len(buf) {
                panic(fmt.Errorf("pread(%v) -> %v", len(buf), n))
        }
}

const BufSize = 4096
var buf [BufSize]byte

func main() {
        flag.Parse()
        if len(flag.Args()) < 1 {
                log.Fatalf("Usage: %s <file>", os.Args[0])
        }

        f, err := os.Open(flag.Args()[0])
        if err != nil {
                log.Fatal(err)
        }

        for size := 0; size <= BufSize; {
                sbuf := buf[:size]

                // warmup
                xpread(f, sbuf, /*offset=*/0)

                Tstart := time.Now()

                for i := 0; i < Niter; i++ {
                        xpread(f, sbuf, /*offset=*/0)
                }

                Tend := time.Now()

                fmt.Printf("%v\t%.1f\n", size, float64(Tend.Sub(Tstart).Nanoseconds()) / Niter)

                switch { 
                case size < 16:
                        size += 1       // detailed start
                case size < 128:
                        size += 16
                default:
                        size += 128
                }
        }
}

(https://play.golang.org/p/eY1WVf7cZV)

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

//               12345678
#define NITER   1000000

// microtime returns current time as double
double microtime() {
        int err;
        struct timeval tv;

        err = gettimeofday(&tv, NULL);
        if (err == -1) {
                perror("gettimeofday");
                abort();
        }

        return tv.tv_sec + 1E-6 * tv.tv_usec;
}

#define BUFSIZE 4096
static char buf[BUFSIZE];

// xpread is pread(2) but aborts on an error or when fewer bytes was read
void xpread(int fd, void *buf, size_t count, off_t offset) {
        ssize_t n;
        n = pread(fd, buf, count, offset);
        if (n == -1) {
                perror("pread");
                abort();
        }

        if (n != count) {
                fprintf(stderr, "pread(%ld) -> %ld\n", count, n);
                abort();
        }
}

int main(int argc, char *argv[]) {
        int fd, i;
        size_t size;

        if (argc < 2) {
                fprintf(stderr, "Usage: %s <file>\n", argv[0]);
                exit(1);
        }

        fd = open(argv[1], O_RDONLY);
        if (fd == -1) {
                perror("open");
                abort();
        }

        for (size = 0; size <= BUFSIZE;) {
                // warmup
                xpread(fd, &buf, size, /*offset=*/0);

                double Tstart, Tend;
                Tstart = microtime();

                for (i = 0; i < NITER; i++) {
                        xpread(fd, buf, size, /*offset=*/0);
                }

                Tend = microtime();

                printf("%ld\t%.1f\n", size, (Tend - Tstart) / NITER / 1E-9);


                if (size < 16) {
                        size += 1;      // detailed start
                } else if  (size < 128) {
                        size += 16;
                } else {
                        size += 128;
                }
        }
}

below is a plot of their output when run on 4K file on a tmpfs (dd if=/dev/urandom of=/dev/shm/x.dat bs=4K count=1):

syspread timings

From the plot it can be seen that the time to do a pread for n bytes is α + β⋅n. On my computer β ≈ 0.01ns and α is ≈ 175ns for C and ≈ 325ns for Go case. The Go syscall is thus taking ~ 150ns more time than the C version.

For programs that make many system calls that can become noticable.

The profile for Go version is below:

File: t_syspread_go
Type: cpu
Time: Mar 15, 2017 at 10:11pm (MSK)
Duration: 18.42s, Total samples = 183.20ms ( 0.99%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 156.70ms, 85.53% of 183.20ms total
Dropped 8 nodes (cum <= 0.92ms)
Showing top 10 nodes out of 24
      flat  flat%   sum%        cum   cum%
  112.80ms 61.57% 61.57%   141.70ms 77.35%  syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    9.20ms  5.02% 66.59%     9.20ms  5.02%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
    6.30ms  3.44% 70.03%    12.90ms  7.04%  runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    4.90ms  2.67% 72.71%     4.90ms  2.67%  internal/poll.(*fdMutex).rwlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
    4.80ms  2.62% 75.33%     4.80ms  2.62%  internal/poll.(*fdMutex).rwunlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
    4.10ms  2.24% 77.57%     5.60ms  3.06%  runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.90ms  2.13% 79.69%     3.90ms  2.13%  runtime.newdefer /home/kirr/src/tools/go/go/src/runtime/panic.go
    3.80ms  2.07% 81.77%    12.80ms  6.99%  runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.60ms  1.97% 83.73%   179.80ms 98.14%  os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
    3.30ms  1.80% 85.53%     6.30ms  3.44%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go

t_syspread_go profile

(svg version here)

from it it could be suggested to investigate the following:

the defer usage in IO poller is quite noticable. Either making defer faster (#14939, #19466, #18923, …) or not using defer could be a solution.
runtime·entersyscall and runtime·exitsyscall takes good part of a time. Maybe something could be done about it? (probably related: #17327)
internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I’m wrong). From pread man page:
```
The  pread()  and  pwrite()  system calls are especially useful in multithreaded applications.  They allow multiple
threads to perform I/O on the same file descriptor without being affected by changes to the file  offset  by  other
threads.
```
(http://man7.org/linux/man-pages/man2/pread.2.html)

Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?
maybe something else.

Once again I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized.

Thanks beforehand, Kirill

P.S. All sources for this benchmark are accessible here: https://lab.nexedi.com/kirr/misc/tree/bfa03189

What did you expect to see?

Making syscall in Go costs only a bit than in C.

What did you see instead?

Making pread syscall in Go is ~ 1.7x slower on average than in C for reads not bigger than one page (4K).

Does this issue reproduce with the latest release (go1.8)?

Yes, but there with go18 it was only ~ 90ns slower:

syspread-go18

in other words pread in go tip is slower than pread in go18 by ~ 60ns.

System details

go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/go"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build990167505=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.0 (stretch)
Release:	9.0
Codename:	stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 3
Comments: 36 (36 by maintainers)

Most upvoted comments

@aclements thanks for your feedback. I was going to blame myself for doing something stupid last time because today I tried and could not reproduce my previous results for BenchmarkHTTPClientServer neither on today’s tip (5a45a157) nor on tree state more-or-less corresponding to time of my comment that claimed -2.15% on the benchmark.

However with all that I could not believe myself to blame because I know I did not cooked the numbers and here they are in raw form for the runs on Mar 20:

kirr@deco:~/src/tools/go/go/test/bench/go1/old$ ll --full-time
-rw-r--r-- 1 kirr kirr 1171 2017-03-20 20:24:38.439971290 +0300 0.txt
-rw-r--r-- 1 kirr kirr 1171 2017-03-20 20:25:58.121720781 +0300 1.txt

kirr@deco:~/src/tools/go/go/test/bench/go1/old$ cat 0.txt # without patch
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             42846 ns/op
BenchmarkHTTPClientServer-4        30000             40022 ns/op
BenchmarkHTTPClientServer-4        30000             40354 ns/op
BenchmarkHTTPClientServer-4        30000             40427 ns/op
BenchmarkHTTPClientServer-4        30000             40177 ns/op
BenchmarkHTTPClientServer-4        50000             42325 ns/op
BenchmarkHTTPClientServer-4        30000             40564 ns/op
BenchmarkHTTPClientServer-4        30000             40597 ns/op
BenchmarkHTTPClientServer-4        30000             40321 ns/op
BenchmarkHTTPClientServer-4        50000             40198 ns/op
BenchmarkHTTPClientServer-4        30000             40224 ns/op
BenchmarkHTTPClientServer-4        30000             40491 ns/op
BenchmarkHTTPClientServer-4        30000             40140 ns/op
BenchmarkHTTPClientServer-4        30000             40580 ns/op
BenchmarkHTTPClientServer-4        50000             40381 ns/op
BenchmarkHTTPClientServer-4        50000             40575 ns/op
BenchmarkHTTPClientServer-4        50000             40123 ns/op
BenchmarkHTTPClientServer-4        30000             40274 ns/op
BenchmarkHTTPClientServer-4        50000             40764 ns/op
BenchmarkHTTPClientServer-4        30000             40157 ns/op
PASS

kirr@deco:~/src/tools/go/go/test/bench/go1/old$ cat 1.txt # with patch
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             42884 ns/op
BenchmarkHTTPClientServer-4        50000             41835 ns/op
BenchmarkHTTPClientServer-4        50000             39534 ns/op
BenchmarkHTTPClientServer-4        50000             39542 ns/op
BenchmarkHTTPClientServer-4        50000             39326 ns/op
BenchmarkHTTPClientServer-4        50000             39487 ns/op
BenchmarkHTTPClientServer-4        50000             39352 ns/op
BenchmarkHTTPClientServer-4        50000             39342 ns/op
BenchmarkHTTPClientServer-4        50000             39323 ns/op
BenchmarkHTTPClientServer-4        50000             39490 ns/op
BenchmarkHTTPClientServer-4        50000             39472 ns/op
BenchmarkHTTPClientServer-4        50000             39496 ns/op
BenchmarkHTTPClientServer-4        50000             39633 ns/op
BenchmarkHTTPClientServer-4        50000             39433 ns/op
BenchmarkHTTPClientServer-4        50000             39490 ns/op
BenchmarkHTTPClientServer-4        50000             39713 ns/op
BenchmarkHTTPClientServer-4        30000             39519 ns/op
BenchmarkHTTPClientServer-4        50000             39577 ns/op
BenchmarkHTTPClientServer-4        50000             39629 ns/op
BenchmarkHTTPClientServer-4        50000             39358 ns/op
PASS

So after trying many things and looking here and there I’ve noticed: the average time to run one operation is sometimes a bit more the more iterations the benchmark driver makes. E.g. (from 0.txt):

BenchmarkHTTPClientServer-4        30000             40177 ns/op
BenchmarkHTTPClientServer-4        50000             42325 ns/op    <-- NOTE
BenchmarkHTTPClientServer-4        30000             40564 ns/op

That means that if the whole time to run 1 benchmark for 30000 iterations was ~ 1s (go test defaults for -benchtime), the time to run 1 benchmark for 50000 iterations was ~ 1.7s - i.e. almost twice as longer.

Then I’ve looked at cpu profile for BenchmarkHTTPClientServer ran with -benchtime 1s and -benchtime 10s and they are visibly different:

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go test -bench Client -cpuprofile cpu.out -benchtime 1s
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4        30000             43363 ns/op
PASS
ok      _/home/kirr/src/tools/go/go/test/bench/go1      3.144s
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go tool pprof go1.test cpu.out 
File: go1.test
Type: cpu
Time: Mar 31, 2017 at 5:37pm (MSK)
Duration: 1.90s, Total samples = 20.80ms ( 1.09%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 11.90ms, 57.21% of 20.80ms total
Dropped 103 nodes (cum <= 0.10ms)
Showing top 20 nodes out of 141
      flat  flat%   sum%        cum   cum%
    5.60ms 26.92% 26.92%     5.80ms 27.88%  syscall.Syscall /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    0.80ms  3.85% 30.77%     2.30ms 11.06%  runtime.findrunnable /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.70ms  3.37% 34.13%     0.70ms  3.37%  runtime.usleep /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.50ms  2.40% 36.54%     0.50ms  2.40%  runtime._ExternalCode /home/kirr/src/tools/go/go/src/runtime/proc.go
    0.40ms  1.92% 38.46%     0.40ms  1.92%  runtime.futex /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.30ms  1.44% 39.90%     0.40ms  1.92%  internal/poll.runtime_pollSetDeadline /home/kirr/src/tools/go/go/src/runtime/netpoll.go
    0.30ms  1.44% 41.35%     0.60ms  2.88%  net/http.readTransfer /home/kirr/src/tools/go/go/src/net/http/transfer.go
    0.30ms  1.44% 42.79%     0.30ms  1.44%  runtime.epollwait /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    0.30ms  1.44% 44.23%     0.30ms  1.44%  runtime.heapBitsSetType /home/kirr/src/tools/go/go/src/runtime/mbitmap.go
    0.30ms  1.44% 45.67%     0.30ms  1.44%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    0.30ms  1.44% 47.12%     0.30ms  1.44%  runtime.memeqbody /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
    0.30ms  1.44% 48.56%     0.30ms  1.44%  runtime.nextFreeFast /home/kirr/src/tools/go/go/src/runtime/malloc.go
    0.30ms  1.44% 50.00%     0.40ms  1.92%  runtime.selectgo /home/kirr/src/tools/go/go/src/runtime/select.go
    0.30ms  1.44% 51.44%     0.30ms  1.44%  sync.(*Pool).pin /home/kirr/src/tools/go/go/src/sync/pool.go
    0.20ms  0.96% 52.40%     1.60ms  7.69%  bufio.(*Reader).ReadLine /home/kirr/src/tools/go/go/src/bufio/bufio.go
    0.20ms  0.96% 53.37%     0.60ms  2.88%  net/http.(*Transport).tryPutIdleConn /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 54.33%     4.40ms 21.15%  net/http.(*conn).readRequest /home/kirr/src/tools/go/go/src/net/http/server.go
    0.20ms  0.96% 55.29%     2.80ms 13.46%  net/http.(*persistConn).readLoop /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 56.25%     3.80ms 18.27%  net/http.(*persistConn).writeLoop /home/kirr/src/tools/go/go/src/net/http/transport.go
    0.20ms  0.96% 57.21%     0.20ms  0.96%  net/http.(*transferWriter).shouldSendContentLength /home/kirr/src/tools/go/go/src/net/http/transfer.go

(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go test -bench Client -cpuprofile cpu.out -benchtime 10s
goos: linux
goarch: amd64
BenchmarkHTTPClientServer-4       300000             41310 ns/op
PASS
ok      _/home/kirr/src/tools/go/go/test/bench/go1      14.265s
(g.env) kirr@deco:~/src/tools/go/go/test/bench/go1$ go tool pprof go1.test cpu.out 
File: go1.test
Type: cpu
Time: Mar 31, 2017 at 5:40pm (MSK)
Duration: 13.01s, Total samples = 158.60ms ( 1.22%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top20
Showing nodes accounting for 79.70ms, 50.25% of 158.60ms total
Dropped 282 nodes (cum <= 0.79ms)
Showing top 20 nodes out of 169
      flat  flat%   sum%        cum   cum%
   29.70ms 18.73% 18.73%    31.50ms 19.86%  syscall.Syscall /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
    6.20ms  3.91% 22.64%     6.20ms  3.91%  runtime.futex /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    4.10ms  2.59% 25.22%     4.10ms  2.59%  runtime.epollwait /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
    4.10ms  2.59% 27.81%     4.80ms  3.03%  runtime.lock /home/kirr/src/tools/go/go/src/runtime/lock_futex.go
    3.80ms  2.40% 30.20%        8ms  5.04%  runtime.selectgo /home/kirr/src/tools/go/go/src/runtime/select.go
    3.50ms  2.21% 32.41%    17.90ms 11.29%  runtime.findrunnable /home/kirr/src/tools/go/go/src/runtime/proc.go
    3.20ms  2.02% 34.43%    10.10ms  6.37%  runtime.mallocgc /home/kirr/src/tools/go/go/src/runtime/malloc.go
    2.80ms  1.77% 36.19%     2.80ms  1.77%  runtime._ExternalCode /home/kirr/src/tools/go/go/src/runtime/proc.go
    2.70ms  1.70% 37.89%     2.70ms  1.70%  runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
    2.60ms  1.64% 39.53%     2.60ms  1.64%  runtime.heapBitsSetType /home/kirr/src/tools/go/go/src/runtime/mbitmap.go
    2.30ms  1.45% 40.98%        3ms  1.89%  runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go
    2.10ms  1.32% 42.31%     2.10ms  1.32%  runtime.usleep /home/kirr/src/tools/go/go/src/runtime/sys_linux_amd64.s
       2ms  1.26% 43.57%        2ms  1.26%  runtime.unlock /home/kirr/src/tools/go/go/src/runtime/lock_futex.go
    1.90ms  1.20% 44.77%     1.90ms  1.20%  sync.(*Mutex).Unlock /home/kirr/src/tools/go/go/src/sync/mutex.go
    1.80ms  1.13% 45.90%     1.80ms  1.13%  runtime.nextFreeFast /home/kirr/src/tools/go/go/src/runtime/malloc.go
    1.40ms  0.88% 46.78%        2ms  1.26%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    1.40ms  0.88% 47.67%     3.50ms  2.21%  runtime.mapassign_faststr /home/kirr/src/tools/go/go/src/runtime/hashmap_fast.go
    1.40ms  0.88% 48.55%     1.40ms  0.88%  runtime.memclrNoHeapPointers /home/kirr/src/tools/go/go/src/runtime/memclr_amd64.s
    1.40ms  0.88% 49.43%     1.40ms  0.88%  sync.(*Mutex).Lock /home/kirr/src/tools/go/go/src/sync/mutex.go
    1.30ms  0.82% 50.25%     3.40ms  2.14%  runtime.runqgrab /home/kirr/src/tools/go/go/src/runtime/proc.go

Notice in particular:

Syscall part of the time is quite different (18% vs 26%)
mallocgc + friends is visible in longer run (not in run for 1s)
casgstatus is visible in longer run (not in run for 1s)

That lead me to thinking that much less garbage collection is actually happenning when bench time is only 1s. That turned out to be true even if the benchmark is repeated several times as go benchmark driver explicitly calls GC in between runs but that time is not accounted for the actual benchmark in question:

https://github.com/golang/go/blob/5a45a157/src/testing/benchmark.go#L141

That means that for benchmarks which actually generate garbage, but not so many to systematically trigger statistically sound number of actual garbage collections in default benchtime=1s, the default benchtime=1s is not appropriate as it actually does not show the cost associated with garbage collections (and casgstatus is associated 100% with the GC) even when run with -test.count=N. And BenchmarkHTTPClientServer happens to be benchmark of such kind.

So with this in mind the correct procedure to test the effect of GC-related patches must run the benchmark with benchtime=Ns (N is timeframe in which GC actually triggers statistically sound). This way I’ve tested once again today’s tip (5a45a157) without and with the patch with N=50s on a Linux 4.9.0-2-amd64 with i7-6600U CPU:

.../go/test/bench/go1$ go test -bench Client -benchtime 50s -count 20 -timeout 50m

and here is what it gives:

.../go/test/bench/go1$ benchstat B3/tip+gc.txt.50 B3/tip+patch+gc.txt.50 
name                old time/op  new time/op  delta
HTTPClientServer-4  41.3µs ± 1%  40.7µs ± 1%  -1.47%  (p=0.000 n=20+20)

I’ve also made 3 smaller runs with N=30s and got for them delta= -0.64% (B1), -3.48% (B2) and -1.42% (B4). It seems to get the real delta with higher probability it is required to run with bigger N and more count, but from the data present it is already clear the delta is oscilating around -1.5% or -2%.

Raw data for the benchmarks are available here: https://lab.nexedi.com/kirr/misc/tree/d13eb36c/bench/nocasgstatus/5a45a157

Hope this clarifies things.

Thanks again, Kirill

P.S. And last time it seems I was lucky to observe the effect with default -benchtime=1s …

navytux on Mar 31, 2017

@aclements first of all thanks for your feedback, and for heads-up that atomics became not so expensive. Looks like cache coherency logic implementation in x86 became much more advanced over time. Anyway

Thanks for confirming my observations with perf
Thanks a lot for commenting on the idea and for try-patch.

On my side with it for pread-related benchmarks I get:

name       old time/op  new time/op  delta
ReadAt1-4   323ns ± 1%   317ns ± 0%  -1.84%  (p=0.000 n=18+19)
Pread1-4    251ns ± 0%   240ns ± 0%  -4.19%  (p=0.000 n=20+20)

which more or less confirms you findings.

Given the cost of most syscalls, I’m not sure that’s worth the added complexity and potential for extremely subtle bugs, but maybe it is.

Here comes something to consider: during e.g. web-serving workload, there are a lot of system calls made. If I understand correctly, actually on e.g. any socket, for Read or Write first a direct system call is made, and only if it returns EAGAIN the fd is registerd to netpoller (or operation succeeds right offhand if it was ready). The syscalls to read and write are also used later to send/receive data when fd becomes ready. So there are many system calls involved.

With this in mind I’ve tried bench/go1’s BenchmarkHTTPClientServer without and with your patch and for it it is the following on my computer:

name                old time/op  new time/op  delta
HTTPClientServer-4  40.4µs ± 1%  39.5µs ± 1%  -2.15%  (p=0.000 n=18+18)

The effect is quite visible, right? And it is only with partly mitigated go13 -> go14 syscall-related change.

So what I would suggest is:

consider going on with atomics avoidance approach.
bisect and see what made syscalls slower during go10/11 -> go12 transition; try to fix that too.
try to remove checks that are not really needed from hot paths.
maybe something else…

All that because runtime·entersyscall / runtime·exitsyscall are present there during every system call made and usually programs make many of them. At least if we consider WEB-servers or programs that work with files.

Thanks again, Kirill

P.S. thanks also for cycles:pp 2 and perflock references.

navytux on Mar 20, 2017

Btw, I’ve looked around a bit and noticed: in runtime·entersyscall / runtime·exitsyscall land a lot of time is spent on atomics, e.g. in casgstatus and it is know that atomics are quite expensive. Prior to go14 g.status was updated in non-atomic way: see commit 0a7c7ac8 (https://golang.org/cl/132960044) - the change was added there to support concurrent GC.

Here is updated benchmark for pread syscall for all Go releases:

name \ time/op  go10.txt    go11.txt    go12.txt    go13.txt    go14.txt    go15.txt    go16.txt    go17.txt    go18.txt    gotip.txt
Pread1-4        226ns ± 0%  229ns ± 0%  240ns ± 1%  242ns ± 2%  261ns ± 1%  263ns ± 1%  267ns ± 0%  262ns ± 1%  255ns ± 2%  252ns ± 1%

(updated benchmark, test)

and it indeed shows go13 -> go14 20ns slowdown (and another go10/go11 -> go12 15ns slowdown I had not checked reason for).

So if atomics for e.g. g.status update were introduced to support concurrent GC, maybe what is also possible to do is to update g.status not always in atomic manner, but in a way similar to what happens in runtime·writebarrierptr() - there before updating a pointer the runtime checks for whether write barrier is at all needed, and if not it becomes simple store. And if I understand correctly write barriers are needed only during certain phase of running garbage collection - in other words “usually not” most of the time. Probably we could have some similar flag for g.status + friends updates too.

I do not know the GC details, so please forgive me in advance if the above is dumb in one way or another. It sounds reasonable to me offhand, however.

Thanks beforehand, Kirill

/cc @RLH, @aclements

navytux on Mar 20, 2017

Your profile is still not even close to CPU-bound: less than 1% of the samples were executing instructions in the user part of the program.

Duration: 14.81s, Total samples = 146.50ms ( 0.99%)

I will also note that 40ns is approximately the cost of an L3 cache miss on modern AMD64 hardware.

bcmills on Mar 17, 2017

Ah, now I understand why the pprof output looks so strange: the program is not even remotely CPU-bound in the application code.

Duration: 18.42s, Total samples = 183.20ms ( 0.99%)

So the pprof output likely looks strange because it’s drawing from a very small sample size: the sampling error swamps out most of the interesting detail.

bcmills on Mar 16, 2017