omicron: cockroachdb crashed in Go runtime during test run: `s.allocCount != s.nelems`

There’s a lot of detail in this report. For a summary of this problem, the root cause, and a workaround, see this comment below.

Again trying to reproduce #1130, I found a different issue that caused CockroachDB to exit early:

ATTEMPT 4517

running 1 test
test integration_tests::disks::test_disk_create_disk_that_already_exists_fails ... FAILED

failures:

---- integration_tests::disks::test_disk_create_disk_that_already_exists_fails stdout ----
log file: "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
note: configured to log to "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
thread 'integration_tests::disks::test_disk_create_disk_that_already_exists_fails' panicked at 'called `Result::unwrap()` on an `Err` value: Exited', /home/dap/omicron/test-utils/src/dev/mod.rs:141:42
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    integration_tests::disks::test_disk_create_disk_that_already_exists_fails

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 74 filtered out; finished in 0.52s

Test log file:

$ bunyan "/dangerzone/omicron_tmp/try_repro.18644/test_all-d586ea57740e3382-test_disk_create_disk_that_already_exists_fails.4297.0.log"
[2022-06-02T04:28:27.661727495Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach temporary directory: /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb
[2022-06-02T04:28:27.662076216Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach: copying from seed directory (/home/dap/omicron/target/debug/build/nexus-test-utils-308521ed0d0eed98/out/crdb-base) to storage directory (/dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/data)
[2022-06-02T04:28:27.67286125Z]  INFO: test_disk_create_disk_that_already_exists_fails/4297 on ivanova: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/listen-url

The CockroachDB output:

$ ls /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/
total 9
-rw-r--r--   1 dap      staff      10265 Jun  1 21:28 cockroachdb_stderr
-rw-r--r--   1 dap      staff          0 Jun  1 21:28 cockroachdb_stdout
drwxr-xr-x   4 dap      staff         22 Jun  1 21:28 data/
$ cat /dangerzone/omicron_tmp/try_repro.18644/.tmpomCYKb/cockroachdb_stderr 
runtime: s.allocCount= 30 s.nelems= 56
fatal error: s.allocCount != s.nelems && freeIndex == s.nelems

goroutine 1 [running, locked to thread]:
runtime.throw(0x5aa10c3, 0x31)
	/opt/ooce/go-1.16/src/runtime/panic.go:1117 +0x72 fp=0xc000d6d2f8 sp=0xc000d6d2c8 pc=0x10a9512
runtime.(*mcache).nextFree(0xfffffc7fef180108, 0x316, 0x340, 0xfffffc7fef180108, 0x380)
	/opt/ooce/go-1.16/src/runtime/malloc.go:880 +0x228 fp=0xc000d6d330 sp=0xc000d6d2f8 pc=0x107dd88
runtime.mallocgc(0x88, 0x57b9880, 0xfda023cce83e2c01, 0x4)
	/opt/ooce/go-1.16/src/runtime/malloc.go:1069 +0x850 fp=0xc000d6d3b8 sp=0xc000d6d330 pc=0x107e5f0
runtime.newobject(...)
	/opt/ooce/go-1.16/src/runtime/malloc.go:1177
runtime.mapassign(0x5305d00, 0xc000cb8720, 0x712bea0, 0xc000cb8720)
	/opt/ooce/go-1.16/src/runtime/map.go:667 +0x286 fp=0xc000d6d438 sp=0xc000d6d3b8 pc=0x1080286
github.com/aws/aws-sdk-go/aws/endpoints.init()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/aws/aws-sdk-go/aws/endpoints/defaults.go:6880 +0x83005 fp=0xc000d7f7a8 sp=0xc000d6d438 pc=0x479d345
runtime.doInit(0x922b7a0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6315 +0xec fp=0xc000d7f8f8 sp=0xc000d7f7a8 pc=0x10b99ec
runtime.doInit(0x9233c80)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fa48 sp=0xc000d7f8f8 pc=0x10b9972
runtime.doInit(0x9242ae0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fb98 sp=0xc000d7fa48 pc=0x10b9972
runtime.doInit(0x92299a0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fce8 sp=0xc000d7fb98 pc=0x10b9972
runtime.doInit(0x9277c00)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7fe38 sp=0xc000d7fce8 pc=0x10b9972
runtime.doInit(0x92220c0)
	/opt/ooce/go-1.16/src/runtime/proc.go:6292 +0x72 fp=0xc000d7ff88 sp=0xc000d7fe38 pc=0x10b9972
runtime.main()
	/opt/ooce/go-1.16/src/runtime/proc.go:208 +0x205 fp=0xc000d7ffe0 sp=0xc000d7ff88 pc=0x10abd25
runtime.goexit()
	/opt/ooce/go-1.16/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000d7ffe8 sp=0xc000d7ffe0 pc=0x10e35a1

goroutine 52 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x74
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35

goroutine 36 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e1a0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 37 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e270)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 38 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e340)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 39 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e410)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 40 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e4e0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 41 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e5b0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 42 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e680)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 43 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e750)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 44 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e820)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 45 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e8f0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 46 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043e9c0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 47 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ea90)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 48 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043eb60)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 49 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ec30)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 50 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043ed00)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 51 [chan receive]:
github.com/klauspost/compress/zstd.(*blockDec).startDecoder(0xc00043edd0)
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:215 +0x149
created by github.com/klauspost/compress/zstd.newBlockDec
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/vendor/github.com/klauspost/compress/zstd/blockdec.go:118 +0x173

goroutine 53 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x12c
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d

goroutine 55 [syscall]:
os/signal.signal_recv(0x0)
	/opt/ooce/go-1.16/src/runtime/sigqueue.go:168 +0xa5
os/signal.loop()
	/opt/ooce/go-1.16/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/opt/ooce/go-1.16/src/os/signal/signal.go:151 +0x45

goroutine 33 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0.func1()
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:165 +0x16b
created by github.com/cockroachdb/cockroach/pkg/util/goschedstats.init.0
	/ws/gc/cockroach/cache/gopath/src/github.com/cockroachdb/cockroach/pkg/util/goschedstats/runnable.go:157 +0x35

This appears to be dying inside the Go runtime memory allocator. It looks like https://github.com/golang/go/issues/45775, which unfortunately was closed last year as not-reproducible.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 16

Most upvoted comments

The signal memory corruption reminds me of https://github.com/golang/go/issues/35777.

Great write up and very informative!

(part 5/5)

Explaining other data points

While debugging this problem, I took a lot of wrong turns. But this resulted in some data that’s worth reviewing in light of this explanation.

We had trouble reproducing this problem on Linux. Well, that checks out. (We now think this is an illumos bug.)

The problem reproduces on Go 1.19. Go 1.19 appears to still use %ymm0 in memclrNoHeapPointers.

We don’t seem to see this problem in CI. Helios CI runs on AWS hosts, which tend to run on older instance types, which may well not support avx2. (thanks @jclulow)

We had trouble reproducing this problem on Intel systems. I don’t have much data about this except what’s reported in golang/go#53289, which does not say much about the systems. It’s quite possible they don’t support avx2.

Thanks!

It feels cheesy to say, but debugging this one has been quite a saga and I had a lot of help over the last several months from talking this over with so many people. Thanks everyone who helped!