go: runtime: working with small maps is 4x-10x slower than in nodejs

Please answer these questions before submitting your issue. Thanks!

What did you do?

Hello up there. Map performance was already discussed some time ago in #3885 and improved a bit. It was also said there that the map algorithm is choosen to work very well with very very large maps. However maps are not always very very large and imho in many practical cases they are small and medium.

So please consider the following 3 programs:

package main
//import "fmt"

func main() {
    a := make(map[int]int)

    for i := 0; i < 100000000; i++ {
        a[i & 0xffff] = i
        //a[i & 0x7f] = i
        //a[i] = i
    }

    //fmt.Println(a)
}

(https://play.golang.org/p/rPH1pSM1Xk)

#!/usr/bin/env nodejs

function main() {
    var a = {};

    for (var i = 0; i < 100000000; i++) {
        a[i & 0xffff] = i;
        //a[i & 0x7f] = i;
        //a[i] = i;
    }

    //console.log(a)
}

main()
#!/usr/bin/env pypy

def main():
    a = {}

    for i in range(100000000):
        a[i & 0xffff] = i
        #a[i & 0x7f] = i
        #a[i] = i

    #print(a)

if __name__ == '__main__':
    main()

The time it takes to run them on i7-6600U is as follows:

Program Time (seconds, best of 5)
map.go 3.668
map.js 0.385
map.py 1.988

The go version is 9.5x slower than javascript one, and ~ 1.8x slower than pypy one.

If we reduce the actual map size from 64K elements to 128 elements, activating the a[i & 0x7f] = i case via e.g. the following patch:

--- a/map.go.kirr
+++ b/map.go
@@ -5,8 +5,8 @@ func main() {
     a := make(map[int]int)
 
     for i := 0; i < 100000000; i++ {
-       a[i & 0xffff] = i
-       //a[i & 0x7f] = i
+       //a[i & 0xffff] = i
+       a[i & 0x7f] = i
        //a[i] = i
     }

timings become:

Program Time (seconds, best of 5)
map.go 1.571
map.js 0.377
map.py 0.896

javascript becomes only a bit faster here while go & pypy improved ~ 2.3x / 2.2x respectively. Still go is 4x slower than javascript and 1.7x slower than pypy.

We can also test how it works if we do not limit the map size and let it grow on every operation. Yes, javascript and pypy are more memory hungry and for original niter=1E8 I’m getting out-of-memory in their cases on my small laptop, but let’s test with e.g. niter=1E7 (diff to original program):

--- a/map.go.kirr
+++ b/map.go
@@ -4,10 +4,10 @@ package main
 func main() {
     a := make(map[int]int)
 
-    for i := 0; i < 100000000; i++ {
-       a[i & 0xffff] = i
+    for i := 0; i < 100000000 / 10; i++ {
+       //a[i & 0xffff] = i
        //a[i & 0x7f] = i
-       //a[i] = i
+       a[i] = i
     }
 
     //fmt.Println(a)

timings become:

Program Time (seconds, best of 5)
map.go 2.877
map.js 0.438
map.py 1.277

So it is go/js ~6.5x slower and go/pypy is ~2.2x slower.

The profile for original program (a[i & 0xffff] = i) is:

File: map
Type: cpu
Time: Mar 10, 2017 at 7:18pm (MSK)
Duration: 3.70s, Total samples = 36ms ( 0.97%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 36000us, 100% of 36000us total
      flat  flat%   sum%        cum   cum%
   27800us 77.22% 77.22%    33900us 94.17%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/hashmap.go
    3100us  8.61% 85.83%     3100us  8.61%  runtime.aeshash64 /home/kirr/src/tools/go/go/src/runtime/asm_amd64.s
    3000us  8.33% 94.17%     3000us  8.33%  runtime.memequal64 /home/kirr/src/tools/go/go/src/runtime/alg.go
    1700us  4.72% 98.89%    36000us   100%  main.main /home/kirr/tmp/trashme/map/map.go
     400us  1.11%   100%      400us  1.11%  runtime.mapassign /home/kirr/src/tools/go/go/src/runtime/stubs.go
         0     0%   100%    36000us   100%  runtime.main /home/kirr/src/tools/go/go/src/runtime/proc.go

What did you expect to see?

Map operations for small / medium maps are as fast or better than in nodejs.

What did you see instead?

Map operations are 4x-10x slower than in javascript for maps sizes that are commonly present in many programs.

Does this issue reproduce with the latest release (go1.8)?

Yes.

System details

go version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/kirr/go"
GORACE=""
GOROOT="/home/kirr/src/tools/go/go"
GOTOOLDIR="/home/kirr/src/tools/go/go/pkg/tool/linux_amd64"
GCCGO="/usr/bin/gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build714926978=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOROOT/bin/go version: go version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 linux/amd64
GOROOT/bin/go tool compile -V: compile version devel +d11a2184fb Fri Mar 10 01:39:09 2017 +0000 X:framepointer
uname -sr: Linux 4.9.0-2-amd64
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.0 (stretch)
Release:	9.0
Codename:	stretch
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Debian GLIBC 2.24-9) stable release version 2.24, by Roland McGrath et al.
gdb --version: GNU gdb (Debian 7.12-6) 7.12.0.20161007-git

Thanks beforehand, Kirill

/cc @rsc, @randall77

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Reactions: 27
  • Comments: 54 (42 by maintainers)

Commits related to this issue

Most upvoted comments

@davecheney, why do you close the issue? Even if nodejs is doing runtime map magic which Go is not willing to accept (but see P.S.), it was already explicitly posted here 3 years ago that:

  • “The pypy is not switching dict implementation, and that
  • go versoion of the benchmark “Still takes ~1.4x - 1.5x longer compared to pypy”:

https://github.com/golang/go/issues/19495#issuecomment-289293859

Why that information is being ignored?

Please reopen the issue, Kirill

P.S. Personally I would also say that since small / medium maps are used everywhere all the time, it makes sense to put V8 style optimization into maps runtime, instead of forcing all programs to be manually rewritten. I would say that by speeding up maps, already existing Go code like fmt, json and similar packages would be automatically speed up too.

( I’m taking time break due to overload; I hope to reply to @networkimprov in 1 or 2 months. I appologize for the inconvenience )

I’ve been doing some profiling and it looks like a good chunk of time is spent iterating inside buckets. My feeling is that the memory layout is distinctly suboptimal because every insert needs to check all cells of the matching bucket in case the matching entry is at the end of the bucket. This implies a constant factor increase up to 8x. Assuming keys are evenly distributed inside buckets, which seems reasonable in this test case, I would expect an average of 4 probes per insert, which seems roughly in line with the observed performance difference with other hash tables (which I suspect use some variant of open addressing and do much less probing in practice).

I’m going to experiment with a few micro-optimizations but I think eventually we’ll have to use a different memory layout to get significant improvements.

Based on the recent responses I am going to close this issue as the current question has been answered and is unrelated to the original inquiry several years ago.

For future commenters—please open a new issue; fill out the template, explain your problem, give a procedure for someone else to reproduce what you are seeing and if it turns out this is an issue, it will be addressed.

Thank you all.

V8 (in Node.js) has two separate dictionary modes for numeric and non-numeric keys in the map, i.e. if you store values by consecutive numerical keys in the {} (object) - V8 would recognize it and back the object with a list in the heap.

@mvdan Yeah, I screwed up the result concatenation the first time. Here goes

name                  old time/op  new time/op  delta
MapAssignInt32_255-8  25.0ns ± 3%  19.0ns ± 4%  -24.20%  (p=0.000 n=10+10)
MapAssignInt32_64k-8  44.6ns ± 3%  37.6ns ± 3%  -15.72%  (p=0.000 n=10+9)
MapAssignInt64_255-8  24.8ns ± 4%  18.9ns ± 4%  -23.82%  (p=0.000 n=10+10)
MapAssignInt64_64k-8  45.2ns ± 1%  38.5ns ± 6%  -14.86%  (p=0.000 n=9+9)
MapAssignStr_255-8    46.9ns ± 1%  24.6ns ± 4%  -47.49%  (p=0.000 n=8+10)
MapAssignStr_64k-8    80.9ns ± 4%  53.3ns ± 3%  -34.10%  (p=0.000 n=10+9)

Huh. We have mapaccess1_fast* and mapaccess2_fast*, but not mapassign_fast*. Maybe we should add that. (Perhaps in the process we should automate generating the fast versions, like we automate the generation of Slice* routines in package sort.) mapdelete_fast* too? Any others?

I’d suggest that as a first step for this issue. I have a few other things on my plate I’d like to do first, but if no one speaks up for it, I’ll plan to eventually get to this.