prometheus: Migrated to 2.4.0 - Crash during startup corrupted WAL "opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

Crashing on startup after moving to 2.4.0

Bug Report

What did you do?

  • Prometheus version:

    2.4.0

  • Prometheus configuration file:

 args:
      - '--config.file=/config/prometheus.yml'
      - '--storage.tsdb.path=/data'
      - '--storage.tsdb.retention=90d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.route-prefix=/'
      - '--web.external-url=https://ingress-platform.live-aws-useast1.bose.io/dev/core-operations/%(environment)s/prometheus2/'
      - '--web.enable-admin-api'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.no-lockfile'

  • Logs:
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444128311Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.0, branch=HEAD, revision=068eaa5dbfce6c08f3d05d3d3c0bfd96267cfed2)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444208601Z caller=main.go:239 build_context="(go=go1.10.3, user=root@d84c15ea5e93, date=20180911-10:46:37)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444232038Z caller=main.go:240 host_details="(Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 prometheus2-78df54755f-sxhzd (none))"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444252099Z caller=main.go:241 fd_limits="(soft=65536, hard=65536)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444290623Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445174596Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445163061Z caller=main.go:554 msg="Starting TSDB ..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445737537Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1524139200000 maxt=1525651200000 ulid=01CCWFHCSNJ6HDYBZP7JYC2YPZ
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446129786Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525651200000 maxt=1527400800000 ulid=01CEGD9PPRP7CBCMSNM5Q5S46W
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446318403Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527400800000 maxt=1527984000000 ulid=01CF1SE7KXPXWSJ2156J1BMECP
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446489737Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527984000000 maxt=1528567200000 ulid=01CFK5NR55KXHPDEMXJ43VGMJG
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446638706Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528567200000 maxt=1529150400000 ulid=01CG4HW6Y1FF16HR11EC178479
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446874633Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529150400000 maxt=1530900000000 ulid=01CHRPM0GA88M75MTVNN3HMNRY
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447141029Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530900000000 maxt=1532649600000 ulid=01CKCV9K8G6E6218842MS5HW4N
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447412757Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1532649600000 maxt=1534399200000 ulid=01CN0ZW1A6T0TVS39DZR2K826Y
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447714569Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534399200000 maxt=1536148800000 ulid=01CPN4FN43S4MQ2HDFKRH9QVS3
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447888122Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536148800000 maxt=1536732000000 ulid=01CQ6GPX7TPKZNV0G4820MRDYC
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447960779Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ74W71A3JATT9K1ETJDXG43
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.448045939Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4XVJDD56G7MC93H05T
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.448129211Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SF6MJRRF8D02GMD16EPWH
prometheus2-78df54755f-sxhzd prometheus2 level=warn ts=2018-09-13T10:30:41.128060768Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 48 at 78834898: unexpected checksum ae7d1bfa, expected 512fafa7"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672805885Z caller=main.go:423 msg="Stopping scrape discovery manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672857035Z caller=main.go:437 msg="Stopping notify discovery manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672868276Z caller=main.go:459 msg="Stopping scrape manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672878612Z caller=main.go:433 msg="Notify discovery manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672908483Z caller=main.go:419 msg="Scrape discovery manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672929891Z caller=main.go:453 msg="Scrape manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672961275Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673005956Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673039451Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673056203Z caller=main.go:608 msg="Notifier manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=error ts=2018-09-13T10:30:43.6731401Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 36 (16 by maintainers)

Most upvoted comments

Hi, thanks for sending the WALs everyone, it made finding and fixing the issue very easy. The fix is out here: https://github.com/prometheus/tsdb/pull/389 and will make it into the 2.4.1 release hopefully today.

Hit the same problem on 2.4.0, 2.4.2 repaired DB and fixed the issue repairing corrupted block :

 caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 2 at 32238072: unexpected checksum 4cbbf0b2, expected c50d6eb7"

@stefancrain Sorry that is the case. Deleting the WAL and restarting would fix it.

Now that we have an offending WAL, we’ll have a fix quite soon, hopefully by tomorrow. It’ll also help if you could forward the mail to krasi to me also at (gouthamve [at] gmail.com)

Hi @Place1, I’ve opened a new issue for it here: https://github.com/prometheus/prometheus/issues/4705

In the meantime while we fix it, you could delete the contents of the WAL directory to unblock prometheus. Please note that there will be loss of some data (upto 2hrs).

2.4.2 looks to have been released ~7 hours ago with the fix for this issue. Our team has upgraded and will report back if we run into this issue again.

same issue with version 2.5.0