thanos: receive: high cpu when upgrading from 0.12.2 with old data

Thanos, Prometheus and Golang version used: thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8) build user: circleci@b8cd18f8b553 build date: 20200622-10:04:50 go version: go1.14.2

Object Storage Provider: GCP

What happened: We’re using 3 thanos-receive v0.12.2 pods running with --receive.replication-factor=3. At 3:30 i’ve restarted one pod (green line) as v0.13.0, and it’s cpu usage doubled: Memory usage is 5-10% higher, which is fine. Here is another graph, from node where pod has been running:

What you expected to happen: Statistically negligible resource usage change between v0.12.2 and v0.13.0 for receive, as for other thanos components.

How to reproduce it (as minimally and precisely as possible): We run receive with such args:

          args:
          - receive
          - |
            --objstore.config=type: GCS
            config:
              bucket: jb-thanos
          - --tsdb.path=/data
          - --tsdb.retention=1d
          - --tsdb.min-block-duration=30m  # have to be >2x(prometheus side 15m), https://github.com/thanos-io/thanos/issues/2114
          - --tsdb.max-block-duration=30m
          - --label=replica="$(NAME)"
          - --receive.local-endpoint=$(NAME).thanos-receive.monitoring.svc.cluster.local:10901
          - --receive.hashrings-file=/cfg/hashrings.json
          - --receive.replication-factor=3

Full logs to relevant components: These are new events, which were not in v0.12.2. Also, they written each ~15sec comparing to ~15min in prometheus v0.19.0 with almost the same settings for tsdb:

Logs

level=info ts=2020-06-23T00:37:15.498358444Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=308.339725ms
level=info ts=2020-06-23T00:37:22.528425096Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=287.226708ms
level=info ts=2020-06-23T00:37:25.321693858Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=51 last=52 duration=2.793211656s
level=info ts=2020-06-23T00:37:34.117011112Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=305.063766ms
level=info ts=2020-06-23T00:37:42.249223736Z caller=head.go:662 component=receive tenant=default-tenant component=tsdb msg="Head GC completed" duration=353.091759ms
level=info ts=2020-06-23T00:37:45.317240183Z caller=head.go:734 component=receive tenant=default-tenant component=tsdb msg="WAL checkpoint complete" first=53 last=54 duration=3.067957396s

Anything else we need to know:

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 1
Comments: 18 (14 by maintainers)

Most upvoted comments

We can repro this on our side, looking into this more (:

bwplotka on Jun 29, 2020

they written each ~15sec comparing to ~15min in prometheus v2.19.0 with almost the same settings for tsdb:

Turns out this high cpu usage happens when i’m start thanos-receive v0.13.0 on non-empty tsdb folder from v0.12.2 (which has many subfolders like 01EC0YX4APXB1MGA39DWFTS96C, etc) This leads to each subfolder being converted to separate tsdb:

...
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1KGAFNHYWZ6Z8CPMZ8TQQ6"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1HSC4SJ9KSZRYENXX9PS18"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1G2E84CNBJ3NFY7P4PE5RW"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1EBGTBP4HRW2WGG9KEXN7P"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1CMKR2521YKPS1283WGBKJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC1AXMY36Z98AQ64QQRD10MP"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC196PXDRFDA39FW476KR7SJ"}
prometheus_tsdb_head_samples_appended_total{app="thanos-receive",...,statefulset_kubernetes_io_pod_name="thanos-receive-1",tenant="01EC17FSYJY8AWTYXCP3QW9J9W"}
...

And then log is full of component=tsdb msg="Head GC completed" events. If I wipe tsdb folder before starting of v0.13.0 - then it consumes the same cpu as v0.12.2

sepich on Jun 30, 2020