thanos: Thanos Receive (RouteOnly mode) Panic

Thanos, Prometheus and Golang version used: Thanos Version: 0.32.4/0.32.5 Golang Version: go1.21.3

Object Storage Provider: AWS S3

What happened: Thanos Receive with route only mode panic frequently, the setup of Receive:

      receive
      --debug.name=thanos-writer
      --log.format=logfmt
      --log.level=info
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --hash-func=SHA256
      --label
      replica="$(NAME)"
      --receive.default-tenant-id=unknown
      --remote-write.address=0.0.0.0:19291
      --receive-forward-timeout=15s
      --receive.hashrings-algorithm=ketama
      --receive.hashrings-file=/var/lib/tsdb/hashring.json
      --receive.hashrings-file-refresh-interval=3m
      --receive.replication-factor=3

What you expected to happen: No panic

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components: Panic from k8s docker logs

ts=2023-11-29T16:15:45.401596471Z caller=receive.go:535 level=info name=thanos-writer component=receive msg="Set up hashring for the given hashring config."
ts=2023-11-29T16:15:45.401626861Z caller=intrumentation.go:56 level=info name=thanos-writer component=receive msg="changing probe status" status=ready
runtime: g17389085: frame.sp=0xc0055cbe58 top=0xc0055cbfe0
	stack=[0xc00554c000-0xc0055cc000
fatal error: traceback did not unwind completely

runtime stack:
runtime.throw({0x26b2ab8?, 0x0?})
	/usr/local/go/src/runtime/panic.go:1077 +0x5c fp=0xc000a0fd40 sp=0xc000a0fd10 pc=0x43b45c
runtime.(*unwinder).finishInternal(0x0?)
	/usr/local/go/src/runtime/traceback.go:571 +0x12a fp=0xc000a0fd80 sp=0xc000a0fd40 pc=0x461d4a
runtime.(*unwinder).next(0xc000a0fe28?)
	/usr/local/go/src/runtime/traceback.go:452 +0x232 fp=0xc000a0fdf8 sp=0xc000a0fd80 pc=0x461b52
runtime.addOneOpenDeferFrame.func1()
	/usr/local/go/src/runtime/panic.go:648 +0x85 fp=0xc000a0ffc8 sp=0xc000a0fdf8 pc=0x43a605
traceback: unexpected SPWRITE function runtime.systemstack
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:509 +0x4a fp=0xc000a0ffd8 sp=0xc000a0ffc8 pc=0x46f70a

goroutine 17389085 [running]:
runtime.systemstack_switch()
	/usr/local/go/src/runtime/asm_amd64.s:474 +0x8 fp=0xc0055cbd68 sp=0xc0055cbd58 pc=0x46f6a8
runtime.addOneOpenDeferFrame(0x0?, 0x0?, 0x0?)
	/usr/local/go/src/runtime/panic.go:645 +0x65 fp=0xc0055cbda8 sp=0xc0055cbd68 pc=0x43a525
panic({0x2219b40?, 0x4238400?})
	/usr/local/go/src/runtime/panic.go:874 +0x14a fp=0xc0055cbe58 sp=0xc0055cbda8 pc=0x43adca
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:261
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:861 +0x378 fp=0xc0055cbeb8 sp=0xc0055cbe58 pc=0x452418
created by github.com/klauspost/compress/s2.(*Writer).write in goroutine 17389042
	/go/pkg/mod/github.com/klauspost/compress@v1.16.7/s2/writer.go:505 +0xb5

.... more goroutine stack trace

goroutine 37381 [IO wait]:
fatal error: unexpected signal during runtime execution
panic during panic
[signal SIGSEGV: segmentation violation code=0x1 addr=0x118 pc=0x45fcfc]

runtime stack:
runtime.throw({0x27e0bfb?, 0x3fcaf40?})
	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc0010c97d8 sp=0xc0010c97a8 pc=0x43907d
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0xc0010c9838 sp=0xc0010c97d8 pc=0x4503a9
runtime.gentraceback(0x3f72aa0?, 0x3fcaf40?, 0xc0010c9bf0?, 0xc0007029c0, 0x0, 0x0, 0x64, 0x0, 0xc0010c9c10?, 0x0)
	/usr/local/go/src/runtime/traceback.go:258 +0x8bc fp=0xc0010c9b90 sp=0xc0010c9838 pc=0x45fcfc
runtime.traceback1(0xc0007029c0?, 0x43ab00?, 0x3?, 0xc0007029c0, 0x462afb?)
	/usr/local/go/src/runtime/traceback.go:776 +0x1b6 fp=0xc0010c9d50 sp=0xc0010c9b90 pc=0x461df6
runtime.traceback(...)
	/usr/local/go/src/runtime/traceback.go:723
runtime.tracebackothers.func1(0xc0007029c0)
	/usr/local/go/src/runtime/traceback.go:992 +0xe5 fp=0xc0010c9d90 sp=0xc0010c9d50 pc=0x462d25
runtime.forEachGRace(0xc0010c9df8)
	/usr/local/go/src/runtime/proc.go:604 +0x4d fp=0xc0010c9dc0 sp=0xc0010c9d90 pc=0x43c90d
runtime.tracebackothers(0xc00052d520?)
	/usr/local/go/src/runtime/traceback.go:978 +0xe5 fp=0xc0010c9e28 sp=0xc0010c9dc0 pc=0x462c05
runtime.dopanic_m(0xc00052d520, 0x2cdf6e8?, 0x1?)
	/usr/local/go/src/runtime/panic.go:1273 +0x285 fp=0xc0010c9ea0 sp=0xc0010c9e28 pc=0x439a65
runtime.fatalthrow.func1()
	/usr/local/go/src/runtime/panic.go:1127 +0x6e fp=0xc0010c9ee0 sp=0xc0010c9ea0 pc=0x43946e
runtime.fatalthrow(0x10c9f28?)
	/usr/local/go/src/runtime/panic.go:1120 +0x6c fp=0xc0010c9f20 sp=0xc0010c9ee0 pc=0x4393cc
runtime.throw({0x2798afc?, 0x100000004?})
	/usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc0010c9f50 sp=0xc0010c9f20 pc=0x43907d
runtime.ready(0xc0054d3520, 0x466fe5?, 0x0?)
	/usr/local/go/src/runtime/proc.go:885 +0x1eb fp=0xc0010c9fa0 sp=0xc0010c9f50 pc=0x43d34b
runtime.goready.func1()
	/usr/local/go/src/runtime/proc.go:392 +0x26 fp=0xc0010c9fc8 sp=0xc0010c9fa0 pc=0x43bee6
runtime.systemstack()
	/usr/local/go/src/runtime/asm_amd64.s:496 +0x49 fp=0xc0010c9fd0 sp=0xc0010c9fc8 pc=0x46e0c9

Anything else we need to know:

Environment:

  • OS (e.g. from /etc/os-release):
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.18.4
PRETTY_NAME="Alpine Linux v3.18"
HOME_URL="https://alpinelinux.org/"
BUG_REPORT_URL="https://gitlab.alpinelinux.org/alpine/aports/-/issues"
  • Kernel (e.g. uname -a): Linux thanos-writer-deployment-77677f8cb8-92h7x 5.4.0-1113-aws-fips #123+fips1-Ubuntu SMP Thu Oct 19 16:21:22 UTC 2023 x86_64 Linux
  • Others:

–>

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

@dctrwatson - The Go team is asking for Linux kernel versions. I don’t know if you have that, but if you do please add it to

@jnyi

https://github.com/golang/go/issues/64781

ok, looks like this https://github.com/klauspost/compress/pull/867 is the root cause and are fixed in https://github.com/thanos-io/thanos/pull/6950, I saw thanos main picked up the newer go mod but not 0.32.5 nor v0.33.0-rc.0.

I will cherry pick the updated go mod in order to fix this internally.

cc @mhoffm-aiven @yeya24 to make sure this gets patched to latest v0.33, thanks

Per the request in https://github.com/golang/go/issues/64781 I added GODEBUG="gccheckmark=1,gcshrinkstackoff=1,asyncpreemptoff=1" and we have not had a panic in >24h. We used to see at least a couple per hour.

We have been unable to pinpoint the origin of similar crashes at MinIO. It seems to happen on only select machines and the only reliable workaround we’ve been using is to compile with go 1.19.x which fixes the issue. I’ve created an issue (link above this post) to see if we can get to the bottom of this!

actually it might be a false resolution, the panic seems still happening after i upgraded compress@v1.71.1 but it was very infrequent, I am trying compress@v1.71.4, i will let it run for a bit longer overnight and report. Sorry for the inclusive post earlier.