thanos: receive: high cpu when upgrading from 0.12.2 with old data
Thanos, Prometheus and Golang version used: thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8) build user: circleci@b8cd18f8b553 build date: 20200622-10:04:50 go version: go1.14.2
Object Storage Provider: GCP
What happened:
We’re using 3 thanos-receive v0.12.2 pods running with --receive.replication-factor=3
.
At 3:30 i’ve restarted one pod (green line) as v0.13.0, and it’s cpu usage doubled:
Memory usage is 5-10% higher, which is fine.
Here is another graph, from node where pod has been running:
What you expected to happen: Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.
How to reproduce it (as minimally and precisely as possible): We run receive with such args:
args:
- receive
- |
--objstore.config=type: GCS
config:
bucket: jb-thanos
- --tsdb.path=/data
- --tsdb.retention=1d
- --tsdb.min-block-duration=30m # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
- --tsdb.max-block-duration=30m
- --label=replica="$(NAME)"
- --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
- --receive.hashrings-file=/cfg/hashrings.json
- --receive.replication-factor=3
Full logs to relevant components: These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:
level=info ts=2020-06-23T00:37:15.498358444Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=308.339725ms
level=info ts=2020-06-23T00:37:22.528425096Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=287.226708ms
level=info ts=2020-06-23T00:37:25.321693858Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=51 last=52 duration=2.793211656s
level=info ts=2020-06-23T00:37:34.117011112Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=305.063766ms
level=info ts=2020-06-23T00:37:42.249223736Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=353.091759ms
level=info ts=2020-06-23T00:37:45.317240183Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=53 last=54 duration=3.067957396s
Anything else we need to know:
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 18 (14 by maintainers)
We can repro this on our side, looking into this more (:
Turns out this high cpu usage happens when i’m start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc) This leads to each subfolder being converted to separate tsdb:
And then log is full of
component=tsdb msg="Head GC completed"
events. If I wipe tsdb folder before starting of v0.13.0 - then it consumes the same cpu as v0.12.2