prometheus: tsdb.Open fails with `invalid magic number 0` when running with reverted previously mmaped chunks

Hi,

When starting the TSDB f4dd45609a05e8f582cdcd8ef369004d1f9e3c02 (initial version of mmap + chunks) used by Thanos Recevie we got following error:

level=error ts=2020-06-15T13:34:38.5403894Z caller=multitsdb.go:271 component=receive tenant=FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 msg="failed to open tsdb" err="invalid magic number 0"
level=warn ts=2020-06-15T13:34:38.540465482Z caller=intrumentation.go:54 component=receive msg="changing probe status" status=not-ready reason="opening storage: invalid magic number 0"
level=info ts=2020-06-15T13:34:38.540508553Z caller=http.go:81 component=receive service=http/server component=receive msg="internal server shutdown" err="opening storage: invalid magic number 0"
level=info ts=2020-06-15T13:34:38.540523593Z caller=intrumentation.go:66 component=receive msg="changing probe status" status=not-healthy reason="opening storage: invalid magic number 0"
level=error ts=2020-06-15T13:34:38.540633727Z caller=main.go:211 err="invalid magic number 0\nopening storage\nmain.runReceive.func1\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/receive.go:316\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373\nreceive command failed\nmain.main\n\t/go/src/github.com/thanos-io/thanos/cmd/thanos/main.go:211\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373"

Repro:

  1. Deploy receive master-2020-05-25-c733564d ( TSDB cd73b3d33e064bbd846fc7a26dc8c313d46af382 - without mmap chunks features)
  2. Upgrade and deploy receive to master-2020-06-03-20004510 which maps TSDB upgrade from to 3268eac2ddda (mainly adds mmap chunks feature + fixes)
  3. Revert to Thanos master-2020-05-25-c733564d (so back to TSDB with no mmap chunks)
  4. Upgraded and deploy to master-2020-05-28-e7d431d3 (TSDB f4dd45609a05e8f582cdcd8ef369004d1f9e3c02 with initial mmap feature).
  5. See crash on startup.

I think we hit either lack of compatibility or some kind of partial write race case. Also, we might want better error wraps in TSDB to ensure which file this actually relates to.

cc @codesome

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 8
  • Comments: 18 (4 by maintainers)

Most upvoted comments

My dirty workaround so far is just to always delete the chunks_head directory prematurely before starting prometheus:

  prometheus:                                            
      container_name: prometheus                         
      image: prom/prometheus:latest                                             
      restart: always                                                                                                                                                                                                              
      entrypoint: ["/bin/sh","-c"]                                                                           
      command:                                                                                               
        - 'rm -rf /prometheus/data/chunks_head; /bin/prometheus --config.file=/etc/prometheus/prometheus.yml'

Losing some data points is better than getting stuck in an infinite loop!

In the meantine, you could do mv prometheus/chunks_head prometheus/chunks_head.bak to keep moving.

any solution to this problem… I have same problem after restarting server

A workaround is to delete the chunk, in your case /prometheus/chunks_head/000007 but beware there may be some data loss.

I kinda wish i could enable auto deleting bad chunks to prevent this. It really sucks that this causes prometheus to just not be able to run.

I think I hit the same error but not sure if it is related to this issue.

ts=2022-04-19T03:05:55.756Z caller=main.go:513 level=info msg="Starting Prometheus" version="(version=2.34.0-rc.0, branch=HEAD, revision=e25a5992555527d26e53ea4d24e1b8bf16528c58)"
ts=2022-04-19T03:05:55.756Z caller=main.go:518 level=info build_context="(go=go1.17.7, user=root@6cedbbd2ba09, date=20220228-15:42:03)"
ts=2022-04-19T03:05:55.756Z caller=main.go:519 level=info host_details="(Linux 5.4.56.bsk.1-amd64 #5.4.56.bsk.1 SMP Debian 5.4.56.bsk.1 Sun Aug 23 12:02:46 UTC 20 x86_64 prometheus-k8s-0 (none))"
ts=2022-04-19T03:05:55.756Z caller=main.go:520 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-04-19T03:05:55.756Z caller=main.go:521 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-04-19T03:05:55.758Z caller=web.go:540 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-04-19T03:05:55.758Z caller=main.go:934 level=info msg="Starting TSDB ..."
ts=2022-04-19T03:05:55.759Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650189600008 maxt=1650196800000 ulid=01G0VSQM5ET0XF8JY29F5PZC6E
ts=2022-04-19T03:05:55.759Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650196800040 maxt=1650204000000 ulid=01G0W0KBDE0D2J5VV0WNCQ2W09
ts=2022-04-19T03:05:55.759Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650204000042 maxt=1650211200000 ulid=01G0W7F2RC4394NGSF6JH7NGP1
ts=2022-04-19T03:05:55.759Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650211200031 maxt=1650218400000 ulid=01G0WEASXETG51QP4PGYS27FT6
ts=2022-04-19T03:05:55.759Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-04-19T03:05:55.764Z caller=main.go:796 level=info msg="Stopping scrape discovery manager..."
ts=2022-04-19T03:05:55.764Z caller=main.go:810 level=info msg="Stopping notify discovery manager..."
ts=2022-04-19T03:05:55.764Z caller=main.go:832 level=info msg="Stopping scrape manager..."
ts=2022-04-19T03:05:55.764Z caller=main.go:806 level=info msg="Notify discovery manager stopped"
ts=2022-04-19T03:05:55.764Z caller=main.go:792 level=info msg="Scrape discovery manager stopped"
ts=2022-04-19T03:05:55.764Z caller=main.go:826 level=info msg="Scrape manager stopped"
ts=2022-04-19T03:05:55.764Z caller=manager.go:946 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-04-19T03:05:55.764Z caller=manager.go:956 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-04-19T03:05:55.764Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-04-19T03:05:55.764Z caller=main.go:1065 level=info msg="Notifier manager stopped"
ts=2022-04-19T03:05:55.764Z caller=main.go:1074 level=error err="opening storage failed: /prometheus/chunks_head/000587: invalid magic number 0"

Running with v2.34.0-rc0 but I cannot find anything related to this fixed recently. @codesome

We’re currently facing the same issue with prometheus v2.32.1:

ts=2022-04-22T11:03:37.419Z caller=main.go:515 level=info msg="Starting Prometheus" version="(version=2.32.1, branch=HEAD, revision=41f1a8125e664985dd30674e5bdf6b683eff5d32)"
ts=2022-04-22T11:03:37.419Z caller=main.go:520 level=info build_context="(go=go1.17.5, user=root@54b6dbd48b97, date=20211217-22:08:06)"
ts=2022-04-22T11:03:37.420Z caller=main.go:521 level=info host_details="(Linux 5.6.13-100.fc30.x86_64 #1 SMP Fri May 15 00:36:06 UTC 2020 x86_64 prometheus-prometheus-kube-prometheus-prometheus-0 (none))"
ts=2022-04-22T11:03:37.420Z caller=main.go:522 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2022-04-22T11:03:37.420Z caller=main.go:523 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2022-04-22T11:03:37.422Z caller=web.go:570 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2022-04-22T11:03:37.424Z caller=main.go:924 level=info msg="Starting TSDB ..."
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1646956800030 maxt=1647151200000 ulid=01FY1F0KGPM44Z4XXT36VYBEPY
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1647151200026 maxt=1647345600000 ulid=01FY78DJHCX7QZJX7YS064S7AZ
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1647345600006 maxt=1647540000000 ulid=01FYD1TC3ZSFKB5QVA4ACZD3WN
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1647540000029 maxt=1647734400000 ulid=01FYJV6RJE1JJHWC2P3NBYXXSM
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1647734400085 maxt=1647928800000 ulid=01FYRMK26V6YJJ0EB1BQP6THB6
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1647928800049 maxt=1648123200000 ulid=01FYYDZZ775RC9SFGSY9KZJRTX
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1648123200022 maxt=1648317600000 ulid=01G0M6AVGAFWX8R14CVEBKRHXP
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1648317600059 maxt=1648512000000 ulid=01G0M6DWM77BNHE21GB80EMVJ6
ts=2022-04-22T11:03:37.425Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1648512000000 maxt=1648706400000 ulid=01G0M6H4427MC8R4E9H9M88JFP
ts=2022-04-22T11:03:37.425Z caller=tls_config.go:231 level=info component=web msg="TLS is disabled." http2=false
ts=2022-04-22T11:03:37.426Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1648706400006 maxt=1648900800000 ulid=01G0M7SASPC5YWPJPXFG62PEHM
ts=2022-04-22T11:03:37.426Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1648900800022 maxt=1649095200000 ulid=01G0M7VBR3S8ZPJZ5T8YQN20XN
ts=2022-04-22T11:03:37.426Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1649095200007 maxt=1649289600000 ulid=01G0M7XM814DRBDQKTAMPDTHR9
ts=2022-04-22T11:03:37.426Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1649289600020 maxt=1649484000000 ulid=01G0MGTPABEZ4BFYHJJJH9RM6K
ts=2022-04-22T11:03:37.427Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1649484000088 maxt=1649678400000 ulid=01G0MGX53SBHXHN1YAA56HDKPF
ts=2022-04-22T11:03:37.427Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1649678400012 maxt=1649872800000 ulid=01G0MGZEQJGRBFSYZVXAKFJPR3
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1649872800045 maxt=1650067200000 ulid=01G0RBWS8W5K9W89BBSEPXJFAR
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650067200051 maxt=1650261600000 ulid=01G0Y59CBG3XXZ5MQ1MSY5D29P
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650261600171 maxt=1650456000000 ulid=01G13YP2GRED9Z7J5HV61T0QW2
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650456000015 maxt=1650520800000 ulid=01G15WF7KEGTE18EQBZJ0J6TBR
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650542400000 maxt=1650549600000 ulid=01G16A67VZAAA0SQM4HDFBR8PA
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650549600002 maxt=1650556800000 ulid=01G16H1Z3T61KZZZRT9CZCH3D3
ts=2022-04-22T11:03:37.428Z caller=repair.go:57 level=info component=tsdb msg="Found healthy block" mint=1650520800024 maxt=1650542400000 ulid=01G16H27A92Q414VVQC4Y1D161
ts=2022-04-22T11:03:37.432Z caller=main.go:799 level=info msg="Stopping scrape discovery manager..."
ts=2022-04-22T11:03:37.432Z caller=main.go:813 level=info msg="Stopping notify discovery manager..."
ts=2022-04-22T11:03:37.432Z caller=main.go:835 level=info msg="Stopping scrape manager..."
ts=2022-04-22T11:03:37.432Z caller=main.go:809 level=info msg="Notify discovery manager stopped"
ts=2022-04-22T11:03:37.432Z caller=main.go:795 level=info msg="Scrape discovery manager stopped"
ts=2022-04-22T11:03:37.432Z caller=main.go:829 level=info msg="Scrape manager stopped"
ts=2022-04-22T11:03:37.432Z caller=manager.go:945 level=info component="rule manager" msg="Stopping rule manager..."
ts=2022-04-22T11:03:37.432Z caller=manager.go:955 level=info component="rule manager" msg="Rule manager stopped"
ts=2022-04-22T11:03:37.432Z caller=notifier.go:600 level=info component=notifier msg="Stopping notification manager..."
ts=2022-04-22T11:03:37.432Z caller=main.go:1055 level=info msg="Notifier manager stopped"
ts=2022-04-22T11:03:37.432Z caller=main.go:1064 level=error err="opening storage failed: /prometheus/chunks_head/005012: invalid magic number 0"

My dirty workaround so far is just to always delete the chunks_head directory prematurely before starting prometheus:

  prometheus:                                            
      container_name: prometheus                         
      image: prom/prometheus:latest                                             
      restart: always                                                                                                                                                                                                              
      entrypoint: ["/bin/sh","-c"]                                                                           
      command:                                                                                               
        - 'rm -rf /prometheus/data/chunks_head; /bin/prometheus --config.file=/etc/prometheus/prometheus.yml'

Losing some data points is better than getting stuck in an infinite loop!

Awesome man