go: runtime: syscalls are ~ 1.4x slower than in C
Update 20170317: part of original 1.7x slowness was found to be due to os.File regression, not syscall itself (https://github.com/golang/go/issues/19563#issuecomment-287296706, #19586). Raw Go syscalls are ~ 1.4x slower than in C.
Please answer these questions before submitting your issue. Thanks!
What did you do?
Hello up there. First of all let me please say right from the beginning that I understand that in Go, due to goroutines, there is more work to make a system call compared to C. Still please consider the following two programs which measure time to do a pread in Go and C:
package main
import (
"flag"
"fmt"
"log"
"os"
"time"
)
// 12345678
const Niter = 1000000
// xpread is pread(2) but aborts on an error or when fewer bytes was read
func xpread(f *os.File, buf []byte, offset int64) {
n, err := f.ReadAt(buf, offset)
if err != nil {
panic(err)
}
if n != len(buf) {
panic(fmt.Errorf("pread(%v) -> %v", len(buf), n))
}
}
const BufSize = 4096
var buf [BufSize]byte
func main() {
flag.Parse()
if len(flag.Args()) < 1 {
log.Fatalf("Usage: %s <file>", os.Args[0])
}
f, err := os.Open(flag.Args()[0])
if err != nil {
log.Fatal(err)
}
for size := 0; size <= BufSize; {
sbuf := buf[:size]
// warmup
xpread(f, sbuf, /*offset=*/0)
Tstart := time.Now()
for i := 0; i < Niter; i++ {
xpread(f, sbuf, /*offset=*/0)
}
Tend := time.Now()
fmt.Printf("%v\t%.1f\n", size, float64(Tend.Sub(Tstart).Nanoseconds()) / Niter)
switch {
case size < 16:
size += 1 // detailed start
case size < 128:
size += 16
default:
size += 128
}
}
}
(https://play.golang.org/p/eY1WVf7cZV)
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
// 12345678
#define NITER 1000000
// microtime returns current time as double
double microtime() {
int err;
struct timeval tv;
err = gettimeofday(&tv, NULL);
if (err == -1) {
perror("gettimeofday");
abort();
}
return tv.tv_sec + 1E-6 * tv.tv_usec;
}
#define BUFSIZE 4096
static char buf[BUFSIZE];
// xpread is pread(2) but aborts on an error or when fewer bytes was read
void xpread(int fd, void *buf, size_t count, off_t offset) {
ssize_t n;
n = pread(fd, buf, count, offset);
if (n == -1) {
perror("pread");
abort();
}
if (n != count) {
fprintf(stderr, "pread(%ld) -> %ld\n", count, n);
abort();
}
}
int main(int argc, char *argv[]) {
int fd, i;
size_t size;
if (argc < 2) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(1);
}
fd = open(argv[1], O_RDONLY);
if (fd == -1) {
perror("open");
abort();
}
for (size = 0; size <= BUFSIZE;) {
// warmup
xpread(fd, &buf, size, /*offset=*/0);
double Tstart, Tend;
Tstart = microtime();
for (i = 0; i < NITER; i++) {
xpread(fd, buf, size, /*offset=*/0);
}
Tend = microtime();
printf("%ld\t%.1f\n", size, (Tend - Tstart) / NITER / 1E-9);
if (size < 16) {
size += 1; // detailed start
} else if (size < 128) {
size += 16;
} else {
size += 128;
}
}
}
below is a plot of their output when run on 4K file on a tmpfs (dd if=/dev/urandom of=/dev/shm/x.dat bs=4K count=1
):
From the plot it can be seen that the time to do a pread for n bytes is α + β⋅n. On my computer β ≈ 0.01ns and α is ≈ 175ns for C and ≈ 325ns for Go case. The Go syscall is thus taking ~ 150ns more time than the C version.
For programs that make many system calls that can become noticable.
The profile for Go version is below:
File: t_syspread_go
Type: cpu
Time: Mar 15, 2017 at 10:11pm (MSK)
Duration: 18.42s, Total samples = 183.20ms ( 0.99%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 156.70ms, 85.53% of 183.20ms total
Dropped 8 nodes (cum <= 0.92ms)
Showing top 10 nodes out of 24
flat flat% sum% cum cum%
112.80ms 61.57% 61.57% 141.70ms 77.35% syscall.Syscall6 /home/kirr/src/tools/go/go/src/syscall/asm_linux_amd64.s
9.20ms 5.02% 66.59% 9.20ms 5.02% runtime.casgstatus /home/kirr/src/tools/go/go/src/runtime/proc.go
6.30ms 3.44% 70.03% 12.90ms 7.04% runtime.reentersyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
4.90ms 2.67% 72.71% 4.90ms 2.67% internal/poll.(*fdMutex).rwlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
4.80ms 2.62% 75.33% 4.80ms 2.62% internal/poll.(*fdMutex).rwunlock /home/kirr/src/tools/go/go/src/internal/poll/fd_mutex.go
4.10ms 2.24% 77.57% 5.60ms 3.06% runtime.exitsyscallfast /home/kirr/src/tools/go/go/src/runtime/proc.go
3.90ms 2.13% 79.69% 3.90ms 2.13% runtime.newdefer /home/kirr/src/tools/go/go/src/runtime/panic.go
3.80ms 2.07% 81.77% 12.80ms 6.99% runtime.exitsyscall /home/kirr/src/tools/go/go/src/runtime/proc.go
3.60ms 1.97% 83.73% 179.80ms 98.14% os.(*File).ReadAt /home/kirr/src/tools/go/go/src/os/file.go
3.30ms 1.80% 85.53% 6.30ms 3.44% runtime.deferreturn /home/kirr/src/tools/go/go/src/runtime/panic.go
(svg version here)
from it it could be suggested to investigate the following:
-
the defer usage in IO poller is quite noticable. Either making defer faster (#14939, #19466, #18923, …) or not using defer could be a solution.
-
runtime·entersyscall and runtime·exitsyscall takes good part of a time. Maybe something could be done about it? (probably related: #17327)
-
internal/poll locking: at least on Linux it is ok to issue several preads simultaneously (please correct me if I’m wrong). From pread man page:
The pread() and pwrite() system calls are especially useful in multithreaded applications. They allow multiple threads to perform I/O on the same file descriptor without being affected by changes to the file offset by other threads.
(http://man7.org/linux/man-pages/man2/pread.2.html)
Thus no locking is needed to serialize pread vs pread or vs read. Maybe some locking is still needed for internal bookkeeping, but may I wonder it can be made lighter?
-
maybe something else.
Once again I understand that in Go, due to goroutines, there is more work to make a system call compared to C, but it would be nice if the overhead is minimized.
Thanks beforehand, Kirill
P.S. All sources for this benchmark are accessible here: https://lab.nexedi.com/kirr/misc/tree/bfa03189
What did you expect to see?
Making syscall in Go costs only a bit than in C.
What did you see instead?
Making pread syscall in Go is ~ 1.7x slower on average than in C for reads not bigger than one page (4K).
Does this issue reproduce with the latest release (go1.8)?
Yes, but there with go18 it was only ~ 90ns slower:
in other words pread in go tip is slower than pread in go18 by ~ 60ns.
System details
go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/go"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build990167505=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +43afcb5c96 Wed Mar 15 04:27:49 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID: Debian
Description: Debian GNU/Linux 9.0 (stretch)
Release: 9.0
Codename: stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 3
- Comments: 36 (36 by maintainers)
@aclements thanks for your feedback. I was going to blame myself for doing something stupid last time because today I tried and could not reproduce my previous results for BenchmarkHTTPClientServer neither on today’s tip (5a45a157) nor on tree state more-or-less corresponding to time of my comment that claimed -2.15% on the benchmark.
However with all that I could not believe myself to blame because I know I did not cooked the numbers and here they are in raw form for the runs on Mar 20:
So after trying many things and looking here and there I’ve noticed: the average time to run one operation is sometimes a bit more the more iterations the benchmark driver makes. E.g. (from 0.txt):
That means that if the whole time to run 1 benchmark for 30000 iterations was ~ 1s (go test defaults for -benchtime), the time to run 1 benchmark for 50000 iterations was ~ 1.7s - i.e. almost twice as longer.
Then I’ve looked at cpu profile for BenchmarkHTTPClientServer ran with
-benchtime 1s
and-benchtime 10s
and they are visibly different:Notice in particular:
That lead me to thinking that much less garbage collection is actually happenning when bench time is only 1s. That turned out to be true even if the benchmark is repeated several times as go benchmark driver explicitly calls GC in between runs but that time is not accounted for the actual benchmark in question:
https://github.com/golang/go/blob/5a45a157/src/testing/benchmark.go#L141
That means that for benchmarks which actually generate garbage, but not so many to systematically trigger statistically sound number of actual garbage collections in default benchtime=1s, the default benchtime=1s is not appropriate as it actually does not show the cost associated with garbage collections (and casgstatus is associated 100% with the GC) even when run with
-test.count=N
. And BenchmarkHTTPClientServer happens to be benchmark of such kind.So with this in mind the correct procedure to test the effect of GC-related patches must run the benchmark with benchtime=Ns (N is timeframe in which GC actually triggers statistically sound). This way I’ve tested once again today’s tip (5a45a157) without and with the patch with N=50s on a Linux 4.9.0-2-amd64 with i7-6600U CPU:
and here is what it gives:
I’ve also made 3 smaller runs with N=30s and got for them delta= -0.64% (B1), -3.48% (B2) and -1.42% (B4). It seems to get the real delta with higher probability it is required to run with bigger N and more count, but from the data present it is already clear the delta is oscilating around -1.5% or -2%.
Raw data for the benchmarks are available here: https://lab.nexedi.com/kirr/misc/tree/d13eb36c/bench/nocasgstatus/5a45a157
Hope this clarifies things.
Thanks again, Kirill
P.S. And last time it seems I was lucky to observe the effect with default
-benchtime=1s
…@aclements first of all thanks for your feedback, and for heads-up that atomics became not so expensive. Looks like cache coherency logic implementation in x86 became much more advanced over time. Anyway
On my side with it for pread-related benchmarks I get:
which more or less confirms you findings.
Here comes something to consider: during e.g. web-serving workload, there are a lot of system calls made. If I understand correctly, actually on e.g. any socket, for Read or Write first a direct system call is made, and only if it returns EAGAIN the fd is registerd to netpoller (or operation succeeds right offhand if it was ready). The syscalls to read and write are also used later to send/receive data when fd becomes ready. So there are many system calls involved.
With this in mind I’ve tried bench/go1’s BenchmarkHTTPClientServer without and with your patch and for it it is the following on my computer:
The effect is quite visible, right? And it is only with partly mitigated go13 -> go14 syscall-related change.
So what I would suggest is:
All that because
runtime·entersyscall
/runtime·exitsyscall
are present there during every system call made and usually programs make many of them. At least if we consider WEB-servers or programs that work with files.Thanks again, Kirill
P.S. thanks also for cycles:pp 2 and perflock references.
Btw, I’ve looked around a bit and noticed: in
runtime·entersyscall
/runtime·exitsyscall
land a lot of time is spent on atomics, e.g. incasgstatus
and it is know that atomics are quite expensive. Prior to go14g.status
was updated in non-atomic way: see commit 0a7c7ac8 (https://golang.org/cl/132960044) - the change was added there to support concurrent GC.Here is updated benchmark for pread syscall for all Go releases:
(updated benchmark, test)
and it indeed shows go13 -> go14 20ns slowdown (and another go10/go11 -> go12 15ns slowdown I had not checked reason for).
So if atomics for e.g.
g.status
update were introduced to support concurrent GC, maybe what is also possible to do is to updateg.status
not always in atomic manner, but in a way similar to what happens inruntime·writebarrierptr()
- there before updating a pointer the runtime checks for whether write barrier is at all needed, and if not it becomes simple store. And if I understand correctly write barriers are needed only during certain phase of running garbage collection - in other words “usually not” most of the time. Probably we could have some similar flag forg.status
+ friends updates too.I do not know the GC details, so please forgive me in advance if the above is dumb in one way or another. It sounds reasonable to me offhand, however.
Thanks beforehand, Kirill
/cc @RLH, @aclements
Your profile is still not even close to CPU-bound: less than 1% of the samples were executing instructions in the user part of the program.
I will also note that 40ns is approximately the cost of an L3 cache miss on modern AMD64 hardware.
Ah, now I understand why the pprof output looks so strange: the program is not even remotely CPU-bound in the application code.
So the pprof output likely looks strange because it’s drawing from a very small sample size: the sampling error swamps out most of the interesting detail.