prometheus: Possible race condition in TSDB head chunks
What did you do?
Build Prometheus master d77b56e88e3d554a499e22d2073812b59191256c with -race
Ran it with a couple of exporters + --storage.tsdb.min-block-duration=10s
10s is very aggressive and not sane. it is yet to see if the race can be produced in real situation
What did you expect to see?
No errors
What did you see instead? Under which circumstances?
================== [56/5719]
WARNING: DATA RACE
Read at 0x00c0002a0840 by goroutine 18:
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Truncate()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:684 +0x50e
github.com/prometheus/prometheus/tsdb.(*Head).Truncate()
/home/roidelapluie/dev/prometheus/tsdb/head.go:796 +0x62c
github.com/prometheus/prometheus/tsdb.(*DB).reload()
/home/roidelapluie/dev/prometheus/tsdb/db.go:958 +0x171f
github.com/prometheus/prometheus/tsdb.(*DB).compactHead()
/home/roidelapluie/dev/prometheus/tsdb/db.go:792 +0x17c
github.com/prometheus/prometheus/tsdb.(*DB).Compact()
/home/roidelapluie/dev/prometheus/tsdb/db.go:763 +0x302
github.com/prometheus/prometheus/tsdb.(*DB).run()
/home/roidelapluie/dev/prometheus/tsdb/db.go:684 +0x427
Previous write at 0x00c0002a0840 by goroutine 115:
sync/atomic.AddInt64()
/home/roidelapluie/godist/go/src/runtime/race_amd64.s:276 +0xb
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).write()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:397 +0x9e
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).writeCRC32()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:410 +0xa0
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:282 +0x6ca
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk()
/home/roidelapluie/dev/prometheus/tsdb/head.go:2000 +0x1a2
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk()
/home/roidelapluie/dev/prometheus/tsdb/head.go:1974 +0x4f
github.com/prometheus/prometheus/tsdb.(*memSeries).append()
/home/roidelapluie/dev/prometheus/tsdb/head.go:2130 +0x781
github.com/prometheus/prometheus/tsdb.(*headAppender).Commit()
/home/roidelapluie/dev/prometheus/tsdb/head.go:1200 +0x4d1
github.com/prometheus/prometheus/tsdb.dbAppender.Commit()
/home/roidelapluie/dev/prometheus/tsdb/db.go:713 +0x42
github.com/prometheus/prometheus/tsdb.(*dbAppender).Commit()
<autogenerated>:1 +0x7f
github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit()
/home/roidelapluie/dev/prometheus/storage/fanout.go:184 +0x70
Environment
-
System information:
Linux 5.6.8-roidelapluie x86_64
-
Prometheus version:
prometheus, version 2.19.2 (branch: master, revision: d77b56e88e3d554a499e22d2073812b59191256c) build user: roidelapluie@oxygen build date: 20200713-21:07:26 go version: go1.14.4
-
Prometheus configuration file:
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
- {targets: ['localhost:1234'], labels: {l: a1}}
- {targets: ['localhost:1234'], labels: {l: a2}}
- {targets: ['localhost:1234'], labels: {l: a3}}
- {targets: ['localhost:1234'], labels: {l: a4}}
- {targets: ['localhost:1234'], labels: {l: a5}}
- Alertmanager configuration file:
insert configuration here (if relevant to the issue)
- Logs:
evel=info ts=2020-07-13T21:22:19.710Z caller=main.go:308 msg="No time or size retention was set so using the default time retention" durat
ion=15d
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:343 msg="Starting Prometheus" version="(version=2.19.2, branch=master, revision=d77b5
6e88e3d554a499e22d2073812b59191256c)"
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:344 build_context="(go=go1.14.4, user=roidelapluie@oxygen, date=20200713-21:07:26)"
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:345 host_details="(Linux 5.6.8-roidelapluie #1 SMP Thu Apr 30 23:23:52 CEST 2020 x86_
64 oxygen (none))"
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:346 fd_limits="(soft=1024, hard=102400)"
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:347 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-07-13T21:22:19.720Z caller=main.go:684 msg="Starting TSDB ..."
level=info ts=2020-07-13T21:22:19.720Z caller=web.go:524 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-07-13T21:22:19.721Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1594674586974 maxt=1594674600000 u
lid=01ED50TDXJET4ZTWN06G7E2K2N
level=info ts=2020-07-13T21:22:19.721Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1594674780000 maxt=1594674840000 u
lid=01ED50WEHMYESE6RTR69DMJT08
level=info ts=2020-07-13T21:22:19.756Z caller=head.go:645 component=tsdb msg="Replaying WAL and on-disk memory mappable chunks if any, this
may take a while"
level=info ts=2020-07-13T21:22:38.306Z caller=head.go:682 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-07-13T21:22:38.525Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=7 maxSegment=11
level=info ts=2020-07-13T21:22:38.611Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=8 maxSegment=11
level=info ts=2020-07-13T21:22:39.117Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=9 maxSegment=11
level=info ts=2020-07-13T21:22:39.465Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=10 maxSegment=11
level=info ts=2020-07-13T21:22:39.466Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=11 maxSegment=11
level=info ts=2020-07-13T21:22:39.466Z caller=head.go:709 component=tsdb msg="WAL replay completed" duration=19.710089391s
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:700 fs_type=EXT4_SUPER_MAGIC
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:701 msg="TSDB started"
level=debug ts=2020-07-13T21:22:40.949Z caller=main.go:702 msg="TSDB options" MinBlockDuration=10s MaxBlockDuration=36h MaxBytes=0B NoLockf
ile=false RetentionDuration=15d WALSegmentSize=0B AllowOverlappingBlocks=false WALCompression=true
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:805 msg="Loading configuration file" filename=prometheus.yml
level=debug ts=2020-07-13T21:22:40.951Z caller=manager.go:226 component="discovery manager scrape" msg="Starting provider" provider=string/
0 subs=[prometheus]
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:244 component="discovery manager scrape" msg="Discoverer channel closed" provider
=string/0
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:226 component="discovery manager notify" msg="Starting provider" provider=string/
0 subs=[config-0]
level=info ts=2020-07-13T21:22:40.952Z caller=main.go:833 msg="Completed loading of configuration file" filename=prometheus.yml
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:244 component="discovery manager notify" msg="Discoverer channel closed" provider
=string/0
level=info ts=2020-07-13T21:22:40.952Z caller=main.go:652 msg="Server is ready to receive web requests."
level=info ts=2020-07-13T21:22:54.275Z caller=head.go:792 component=tsdb msg="Head GC completed" duration=859.497583ms
level=info ts=2020-07-13T21:23:11.828Z caller=head.go:872 component=tsdb msg="WAL checkpoint complete" first=7 last=9 duration=17.528750204
s
level=info ts=2020-07-13T21:23:31.954Z caller=compact.go:495 component=tsdb msg="write block" mint=1594674850000 maxt=1594674860000 ulid=01
ED50YVKKMQWM05Y9EXJ4V37W duration=19.839084786s
level=info ts=2020-07-13T21:23:33.353Z caller=head.go:792 component=tsdb msg="Head GC completed" duration=1.015720132s
==================
WARNING: DATA RACE
Read at 0x00c0002a0840 by goroutine 18:
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Truncate()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:684 +0x50e
github.com/prometheus/prometheus/tsdb.(*Head).Truncate()
/home/roidelapluie/dev/prometheus/tsdb/head.go:796 +0x62c
github.com/prometheus/prometheus/tsdb.(*DB).reload()
/home/roidelapluie/dev/prometheus/tsdb/db.go:958 +0x171f
github.com/prometheus/prometheus/tsdb.(*DB).compactHead()
/home/roidelapluie/dev/prometheus/tsdb/db.go:792 +0x17c
github.com/prometheus/prometheus/tsdb.(*DB).Compact()
/home/roidelapluie/dev/prometheus/tsdb/db.go:763 +0x302
github.com/prometheus/prometheus/tsdb.(*DB).run()
/home/roidelapluie/dev/prometheus/tsdb/db.go:684 +0x427
Previous write at 0x00c0002a0840 by goroutine 115:
sync/atomic.AddInt64()
/home/roidelapluie/godist/go/src/runtime/race_amd64.s:276 +0xb
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).write()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:397 +0x9e
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).writeCRC32()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:410 +0xa0
github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk()
/home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:282 +0x6ca
github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk()
/home/roidelapluie/dev/prometheus/tsdb/head.go:2000 +0x1a2
github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk()
/home/roidelapluie/dev/prometheus/tsdb/head.go:1974 +0x4f
github.com/prometheus/prometheus/tsdb.(*memSeries).append()
/home/roidelapluie/dev/prometheus/tsdb/head.go:2130 +0x781
github.com/prometheus/prometheus/tsdb.(*headAppender).Commit()
/home/roidelapluie/dev/prometheus/tsdb/head.go:1200 +0x4d1
github.com/prometheus/prometheus/tsdb.dbAppender.Commit()
/home/roidelapluie/dev/prometheus/tsdb/db.go:713 +0x42
github.com/prometheus/prometheus/tsdb.(*dbAppender).Commit()
<autogenerated>:1 +0x7f
github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit()
/home/roidelapluie/dev/prometheus/storage/fanout.go:184 +0x70
github.com/prometheus/prometheus/scrape.(*timeLimitAppender).Commit()
<autogenerated>:1 +0x62
github.com/prometheus/prometheus/scrape.(*scrapeLoop).report.func1()
/home/roidelapluie/dev/prometheus/scrape/scrape.go:1293 +0x5b
github.com/prometheus/prometheus/scrape.(*scrapeLoop).report()
/home/roidelapluie/dev/prometheus/scrape/scrape.go:1311 +0x4ed
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run()
/home/roidelapluie/dev/prometheus/scrape/scrape.go:986 +0xc7b
Goroutine 18 (running) created at:
github.com/prometheus/prometheus/tsdb.open()
/home/roidelapluie/dev/prometheus/tsdb/db.go:640 +0xb2e
github.com/prometheus/prometheus/tsdb.Open()
/home/roidelapluie/dev/prometheus/tsdb/db.go:512 +0xdb
main.openDBWithMetrics()
/home/roidelapluie/dev/prometheus/cmd/prometheus/main.go:771 +0x120
main.main.func20()
/home/roidelapluie/dev/prometheus/cmd/prometheus/main.go:690 +0x337
github.com/oklog/run.(*Group).Run.func1()
/home/roidelapluie/dev/prometheus/vendor/github.com/oklog/run/group.go:38 +0x34
Goroutine 115 (running) created at:
github.com/prometheus/prometheus/scrape.(*scrapePool).sync()
/home/roidelapluie/dev/prometheus/scrape/scrape.go:422 +0x7db
github.com/prometheus/prometheus/scrape.(*scrapePool).Sync()
/home/roidelapluie/dev/prometheus/scrape/scrape.go:378 +0x7fb
github.com/prometheus/prometheus/scrape.(*Manager).reload.func1()
/home/roidelapluie/dev/prometheus/scrape/manager.go:197 +0x56
==================
level=error ts=2020-07-13T21:23:37.744Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: write compaction:
found chunk with minTime: 1594674853491 maxTime: 1594674864204 outside of compacted minTime: 1594674860000 maxTime: 1594674870000"
level=error ts=2020-07-13T21:23:42.899Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: write compaction:
found chunk with minTime: 1594674853491 maxTime: 1594674864204 outside of compacted minTime: 1594674860000 maxTime: 1594674870000"
^Clevel=warn ts=2020-07-13T21:23:47.779Z caller=main.go:530 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:553 msg="Stopping scrape discovery manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:567 msg="Stopping notify discovery manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:589 msg="Stopping scrape manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:549 msg="Scrape discovery manager stopped"
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:563 msg="Notify discovery manager stopped"
level=info ts=2020-07-13T21:23:48.124Z caller=manager.go:888 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-07-13T21:23:48.124Z caller=main.go:583 msg="Scrape manager stopped"
level=info ts=2020-07-13T21:23:48.124Z caller=manager.go:898 component="rule manager" msg="Rule manager stopped"
level=error ts=2020-07-13T21:23:48.975Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: 2 errors: write co
mpaction: add symbol: context canceled; context canceled"
level=info ts=2020-07-13T21:23:49.019Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2020-07-13T21:23:49.020Z caller=main.go:755 msg="Notifier manager stopped"
level=info ts=2020-07-13T21:23:49.020Z caller=main.go:767 msg="See you next time!"
Found 1 data race(s)
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (17 by maintainers)
Checking once again,
curFileSequence
could only be changed withwritePathMtx
held. So I guess there is no issue here.@codesome it is fixed by using a mutex to protect taking length of a map operation.