go: cmd/link: `go tool dist test testshared` failed if linked with lld or mold

What version of Go are you using (go version)?

$ go version
go version devel go1.17-962d5c997a Fri Jun 4 01:31:23 2021 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ruiu/.cache/go-build"
GOENV="/home/ruiu/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ruiu/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ruiu/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/ruiu/golang"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/ruiu/golang/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="devel go1.17-962d5c997a Fri Jun 4 01:31:23 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/ruiu/golang/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build237113391=/tmp/go-build -gno-record-gcc-switches

What did you do?

I tried to build Go with my own linker, mold (https://github.com/rui314/mold), and noticed that a CGO-related test fails only when linked with mold. The same test fails with lld. So the test seems to pass only when you are using GNU ld or GNU gold.

Specifically, this is the exact command that I can reproduce the issue on my Ubuntu 20.04 machine.

$ git clone git@github.com:golang/go.git golang
$ cd golang/src
$ ./make.bash
$ sudo ln -sf /usr/bin/ld.lld-11 /usr/bin/ld
$ ../bin/go tool dist test testshared

If I do not substitute the default linker with lld using sudo ln, the last test command succeeds. Before running the above command, please install LLVM lld 11 by apt-get install lld-11.

To restore the original ld, run (cd /usr/bin; sudo ln -sf x86_64-linux-gnu-ld ld).

What did you expect to see?

The test succeeds

What did you see instead?

The test fails with the following error message.

--- FAIL: TestGCData (0.61s)
    shared_test.go:50: executing ./main (running gcdata/main) failed exit status 2:
        x[4] == -2401053088876216593, want 12345
        panic: FAIL

        goroutine 1 [running]:
        panic({0x7ff705a2d180, 0x556bea3cb938})
                /home/ruiu/golang/src/runtime/panic.go:1147 +0x3d3 fp=0xc0001b1f00 sp=0xc0001b1e40 pc=0x7ff7059b11f3
        main.main()
                /tmp/shared_test2784724792/gopath/src/testshared/gcdata/main/main.go:34 +0x14c fp=0xc0001b1f80 sp=0xc0001b1f00 pc=0x556bea3b6bec
        runtime.main()
                /home/ruiu/golang/src/runtime/proc.go:255 +0x282 fp=0xc0001b1fe0 sp=0xc0001b1f80 pc=0x7ff7059b4d42
        runtime.goexit()
                /home/ruiu/golang/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc0001b1fe8 sp=0xc0001b1fe0 pc=0x7ff7059ed801

So, the test fails because Go garbage collector wrongly collects live objects.

I’m debugging the issue for two days so far without any luck. It looks like if I build all but gopath/pkg/linux_amd64_dynlink/libtestshared-gcdata-p.so using lld and link the particular DSO using GNU ld, the test passes. But I can’t find a cause why that test dislikes lld or mold-linked shared object file. Is there any chance that CGO unnecessarily depends on GNU ld-specific section or symbol layout or something?

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 2
  • Comments: 15 (14 by maintainers)

Most upvoted comments

So, I think the proper fix is to change loader.go so that it apply dynamic relocations for a DSO before reading its section contents and returning it to decodetypeGcmask.

Thanks. Yes, you made that clear already in your previous comment 8 days ago, and it is indeed evident that the problem is due to the fact that the go linker is not applying dynamic relocations to the section in question. Please bear with me while I work on this bug; I have many other demands on my time; I need to balance working on your bug with working on other bugs as well. Thank for your patience.

It was extremely puzzling, but I think I found the cause of the issue. It looks like there’s a bug in go’s linker. Here is what was happening:

  • src/cmd/link/internal/loader/loader.go has code that reads section contents from a DSO. decodetypeGcmask in decodesym.go calls ctxt.loader.Data to reads GC bitmaps from a DSO’s .data.rel.ro section for type symbols.
  • .data.rel.ro may have dynamic relocations. Therefore, even if the section contents are just zeros, it may have non-zero values at runtime.
  • For REL-type relocations, relocation addends are stored to sections. The dynamic loader is expected to add a relocated value to an existing value in a section. On the other hand, for RELA-type relocations, addends are stored to relocations themselves, and the dynamic loader overwrites existing values. x86-64 uses RELA-type relocations.
  • GNU linkers always write addends to sections even for RELA-type relocations. That’s not necessary, though that doesn’t hurt anyone, as such values will just be overwritten at runtime.
  • lld and mold don’t write addends to sections for RELA-type relocations. They are just left as zero bytes.
  • When reading section contents, src/cmd/link/internal/loader/loader.go does not seem to apply dynamic relocations before reading section contents. Therefore, some values that are non-zero when generated by a GNU linker seem to be just zeros when generated by lld or mold. That causes the difference of GC bitmaps. mold/lld-generated GC bitmaps have more zeros, causing GC to reclaim more objects than it should be.

So, I think the proper fix is to change loader.go so that it apply dynamic relocations for a DSO before reading its section contents and returning it to decodetypeGcmask.

I’ll take a look.