prometheus: Prometheus v2.26.0 go panic when scraping single target with 7 million series

What did you do? I use version v2.26.0 to start up a prometheus pod in k8s cluster

What did you expect to see? I expect prometheus pod can keep running and generate blocks in path specified in storage.tsdb.path

What did you see instead? Under which circumstances? pod keep crash due to same golang panic, paste logs below

Environment

System information:

-bash-4.2# uname -srm
Linux 5.4.0-59.generic.x86_64 x86_64

Prometheus version:

/prometheus # /bin/prometheus --version
prometheus, version 2.26.0 (branch: HEAD, revision: 3cafc58827d1ebd1a67749f88be4218f0bab3d8d)
  build user:       root@a67cafebe6d0
  build date:       20210331-11:56:23
  go version:       go1.16.2
  platform:         linux/amd64

Alertmanager version:
Prometheus configuration file:
Alertmanager configuration file:

insert configuration here (if relevant to the issue)

Logs:

level=info ts=2021-04-26T23:19:42.966Z caller=main.go:767 msg="Server is ready to receive web requests."
panic: snappy: decoded block is too large

goroutine 145955 [running]:
github.com/golang/snappy.Encode(0xc590980000, 0x5e2c9b, 0x5e2c9b, 0x10383c34000, 0xdd77491e, 0xe0728000, 0xb38ebffd, 0xe0728000, 0x18)
	/go/pkg/mod/github.com/golang/snappy@v0.0.3/encode.go:22 +0x2d7
github.com/prometheus/prometheus/tsdb/wal.(*WAL).log(0xc0002ce2d0, 0x10383c34000, 0xdd77491e, 0xe0728000, 0x6aa201, 0xe1daa76b08, 0xfb1e619de0)
	/app/tsdb/wal/wal.go:624 +0x52d
github.com/prometheus/prometheus/tsdb/wal.(*WAL).Log(0xc0002ce2d0, 0xddb4723618, 0x1, 0x1, 0x0, 0x0)
	/app/tsdb/wal/wal.go:596 +0xed
github.com/prometheus/prometheus/tsdb.(*headAppender).log(0xe0778280b0, 0x0, 0x0)
	/app/tsdb/head.go:1368 +0x34a
github.com/prometheus/prometheus/tsdb.(*headAppender).Commit(0xe0778280b0, 0x0, 0x0)
	/app/tsdb/head.go:1388 +0x9e
github.com/prometheus/prometheus/tsdb.dbAppender.Commit(0x32d7258, 0xe0778280b0, 0xc000b5c000, 0x5efc59531429db5e, 0x13)
	/app/tsdb/db.go:808 +0x35
github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit(0xdf13284140, 0x1747e22, 0xed819422e)
	/app/storage/fanout.go:176 +0x49
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport.func1(0xddb4723ce8, 0xddb4723cf8, 0xe4c896cb00)
	/app/scrape/scrape.go:1096 +0x49
github.com/prometheus/prometheus/scrape.(*scrapeLoop).scrapeAndReport(0xe4c896cb00, 0x1bf08eb000, 0x1bf08eb000, 0x0, 0x0, 0x0, 0x1747e22, 0xed819422e, 0x4421000, 0x0, ...)
	/app/scrape/scrape.go:1163 +0xb4d
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run(0xe4c896cb00, 0x1bf08eb000, 0x1bf08eb000, 0x0)
	/app/scrape/scrape.go:1049 +0x365
created by github.com/prometheus/prometheus/scrape.(*scrapePool).sync
	/app/scrape/scrape.go:518 +0x9ce

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (11 by maintainers)

Most upvoted comments

Thank you for the information. scrape_samples_post_metric_relabeling is the number of samples scraped from a target that actually make it into the database. It is the same as scrape_samples_scraped minus any metrics dropped via metric relabeling rules.

Judging from your information, you have ~400 characters per metric for labels on the job + probably 100-200 characters for additional labels such as pod, container, etc that are exposed by kube state metrics. Multiply that estimate by more than 7 million series being returned by kube state metrics means that the record could be as large as 3.3 to 4 GB in size, and the maximum snappy encoding size is 3.7 GB. The panic will be fixed for this size of scrape in Prometheus 2.27.0 (release candidate to be created shortly). The large records wouldn’t be compressed to avoid this panic, in the future we hope to compress them as well, see https://github.com/prometheus/prometheus/issues/8791.

csmarchbanks on May 7, 2021