prometheus: tsdb.Open fails with `invalid magic number 0` when running with reverted previously mmaped chunks
Hi,
When starting the TSDB f4dd45609a05e8f582cdcd8ef369004d1f9e3c02 (initial version of mmap + chunks) used by Thanos Recevie we got following error:
level=error ts=2020-06-15T13:34:38.5403894Z caller=multitsdb.go:271 component=receive tenant=FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 msg="failed to open tsdb" err="invalid magic number 0"
level=warn ts=2020-06-15T13:34:38.540465482Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason="opening storage: invalid magic number 0"
level=info ts=2020-06-15T13:34:38.540508553Z caller=http.go:81 component=receive service=http/server component=receive msg="internal server shutdown" err="opening storage: invalid magic number 0"
level=info ts=2020-06-15T13:34:38.540523593Z caller=intrumentation.go:66 component=receive msg="changing probe status" status=not-healthy reason="opening storage: invalid magic number 0"
level=error ts=2020-06-15T13:34:38.540633727Z caller=main.go:211 err="invalid magic number 0\nopening storage\nmain.runReceive.func1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:316\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nreceive command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:211\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"
Repro:
- Deploy receive
master-2020-05-25-c733564d
( TSDB cd73b3d33e064bbd846fc7a26dc8c313d46af382 - without mmap chunks features) - Upgrade and deploy receive to
master-2020-06-03-20004510
which maps TSDB upgrade from to 3268eac2ddda (mainly adds mmap chunks feature + fixes) - Revert to Thanos
master-2020-05-25-c733564d
(so back to TSDB with no mmap chunks) - Upgraded and deploy to
master-2020-05-28-e7d431d3
(TSDB f4dd45609a05e8f582cdcd8ef369004d1f9e3c02 with initial mmap feature). - See crash on startup.
I think we hit either lack of compatibility or some kind of partial write race case. Also, we might want better error wraps in TSDB to ensure which file this actually relates to.
cc @codesome
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 8
- Comments: 18 (4 by maintainers)
My dirty workaround so far is just to always delete the
chunks_head
directory prematurely before starting prometheus:Losing some data points is better than getting stuck in an infinite loop!
In the meantine, you could do
mv prometheus/chunks_head prometheus/chunks_head.bak
to keep moving.A workaround is to delete the chunk, in your case
/prometheus/chunks_head/000007
but beware there may be some data loss.I kinda wish i could enable auto deleting bad chunks to prevent this. It really sucks that this causes prometheus to just not be able to run.
I think I hit the same error but not sure if it is related to this issue.
Running with
v2.34.0-rc0
but I cannot find anything related to this fixed recently. @codesomeWe’re currently facing the same issue with prometheus
v2.32.1
:Awesome man