thanos: receive: can not load WAL data when restart
Thanos, Prometheus and Golang version used:
Thanos: self build from thanos master: a09a4b97c243b7652446685c73f6b80bb9417fe2 Golang: go version go1.12.7 darwin/amd64 Prometheus: 2.13.0
thanos build command: GOOS=linux GOARCH=amd64 go build -o thanos ./cmd/thanos
What happened:
I’m trying to use thanos receive, they are running as expected hours before I restart it. the thanos receive try to load WAL data again and again
What you expected to happen:
load WAL data and listen & receive new data.
How to reproduce it (as minimally and precisely as possible):
I’m not sure is there any logic issue when receive starting? the thanos-0.7 can restart successful but master code cannot.
Full logs to relevant components:
Logs
level=info ts=2019-10-10T13:11:00.909450843Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:11:00.909646509Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:11:00.910759805Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:11:00.913326316Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:11:00.913810978Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:11:00.913638216Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:11:00.913837063Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:11:00.91392518Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:11:00.914250195Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:11:00.91435138Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:11:00.914334174Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:11:00.914388157Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:11:00.914437502Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:11:01.480532567Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:11:54.313611621Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:11:58.976905831Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:12:03.674326957Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:12:08.825274021Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:12:13.699480427Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:12:18.64123707Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167
level=info ts=2019-10-10T13:12:23.570922965Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2039 maxSegment=2167
.....
level=info ts=2019-10-10T13:22:40.613369573Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2165 maxSegment=2167
level=warn ts=2019-10-10T13:22:41.261691466Z caller=head.go:492 component=receive component=tsdb component=tsdb msg="unknown series references" count=104
level=info ts=2019-10-10T13:22:41.261809977Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2166 maxSegment=2167
level=info ts=2019-10-10T13:22:41.26233365Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2167 maxSegment=2167
level=info ts=2019-10-10T13:23:18.541986663Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:24:06.787360411Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:24:11.174920806Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:24:15.48524744Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:24:20.140672496Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:24:24.467852131Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:24:28.944796843Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167
....
level=info ts=2019-10-10T13:33:51.954823544Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2136 maxSegment=2167
level=warn ts=2019-10-10T13:33:57.647949531Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973
level=info ts=2019-10-10T13:33:57.648056531Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2137 maxSegment=2167
level=warn ts=2019-10-10T13:34:03.485428154Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=7034
level=info ts=2019-10-10T13:34:03.485536561Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2138 maxSegment=2167
level=info ts=2019-10-10T13:34:22.347698132Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:34:22.348067504Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:34:22.34921212Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:34:22.353678479Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:34:22.353887745Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:34:22.354027884Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:34:22.354002076Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:34:22.354402573Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:34:22.35507048Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:34:22.355474967Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:34:22.355602831Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:34:22.355642604Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:34:22.355704854Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:34:22.840486871Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
Environment:
- OS (e.g. from /etc/os-release): ubuntu
- Kernel (e.g.
uname -a
): 4.15.0 - Others:
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 41 (14 by maintainers)
stills valid for v0.17.2, after restart receiver component. OMM Killed
Still happening on 0.25.1
Still happening on 0.24.0
This is still happening on the latest release
same problem for v0.12.2, restart receive, then oom
This still happens with Thanos v0.11.0
@squat I have tried the new version which commit id is
48a8fb6e2f6a476bcffa508d6609a19847c695ef
but I got OOM on the 128GB memory hosts… and the data folder size just about 6GB, could you help to look into this?
I shared the pprof/heap file to you: https://drive.google.com/file/d/1iKqfMD9brOXbt7mLqJCX689AhRPuzJ_N/view?usp=sharing