prometheus: does not start up after corrupted meta.json file
This ticket is a follow up of #2805 (there are similar comments at the bottom after closing it)
What did you do?
Run prometheus in a kubernetes cluster. On a GCE PD disk.
What did you see instead? Under which circumstances?
It crashed upon start, logfile:
level=error ts=2018-04-07T04:28:53.784390578Z caller=main.go:582
err="Opening storage failed unexpected end of JSON input"
level=info ts=2018-04-07T04:28:53.784418708Z caller=main.go:584
msg="See you next time!"
The point here is, that the meta.json
file has a size of zero:
> ls -l /data/*/meta.json
[...]
-rw-rw-r-- 1 1000 2000 283 Apr 7 03:05 01CAF10VV5FNDJ6PG84E6RSEV3/meta.json
-rw-rw-r-- 1 1000 2000 0 Apr 7 03:15 01CAF1K5SQZT4HBQE9P6W7J56E/meta.json
Manual resolution
I’ve deleted that directory 01CAF1K5SQZT4HBQE9P6W7J56E
with the problematic meta.json file in it and now it start up fine again.
Environment
-
System information:
Linux 4.10.0-40-generic x86_64
-
Prometheus version:
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
build user: root@149e5b3f0829
build date: 20180314-14:15:45
go version: go1.10
(“official” docker build)
- Logs:
level=error ts=2018-04-07T08:51:28.789897822Z caller=main.go:582 err="Opening storage failed unexpected end of JSON input"
Expected behavior
What I would wish is that prometheus starts up and doesn’t CrashLoop. It should either
- ignore that directory, saying in the log that meta.json is faulty
- maybe move it to
[directoryname].broken/
? - reconstruct the meta.json file from the data
- delete the problematic directory (bit harsh, ignoring might be better)
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 7
- Comments: 37 (26 by maintainers)
Just opened a PR that would close this issue and will go in the next release. Feel free to reopen if you still experience the same issue after that.
I faced this problem also today. Finding and deleting the folder with the empty json solved the issue.
@krasi-georgiev I’ll be happy to test (though, definitely not in production 😃), but crash-related bugs are notoriously hard to reproduce. I tried to check the bug using ALICE (http://research.cs.wisc.edu/adsl/Software/alice/doc/adsl-doc.html) which greatly helped me in the past and that is what I got:
Here is the write part of the test (tsdb.WriteMetaFile just calls tsdb.writeMetaFile):
And here the checker:
This is what I got with unmodified tsdb:
And this is what I got after adding
f.Sync()
call beforef.Close
in writeMetaFile:While this is definitely not a proof that the bug is indeed fixed, the tool has great track record and usually finds real problems.
I am not 100% sure, but logically I would say at least 5 times your biggest block. Maybe a bit less will do, but storage is not expensive these days so better be on the safe side.
btw there are plans to add storage based retention so should help use cases where storage is limited. https://github.com/prometheus/tsdb/pull/343
@Vlaaaaaaad , @bmihaescu are you sure you have enough free space?. (suggested by Brian on IRC so worth checking.) During compaction it needs a bit for temporary holder during the joining.