prometheus: Possible race condition in TSDB head chunks

What did you do?

Build Prometheus master d77b56e88e3d554a499e22d2073812b59191256c with -race

Ran it with a couple of exporters + --storage.tsdb.min-block-duration=10s

10s is very aggressive and not sane. it is yet to see if the race can be produced in real situation

What did you expect to see?

No errors

What did you see instead? Under which circumstances?

==================                                                                                                                [56/5719]
WARNING: DATA RACE                                                                                                                         
Read at 0x00c0002a0840 by goroutine 18:                                                                                                    
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Truncate()                                                               
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:684 +0x50e                                                              
  github.com/prometheus/prometheus/tsdb.(*Head).Truncate()                                                                                 
      /home/roidelapluie/dev/prometheus/tsdb/head.go:796 +0x62c                                                                            
  github.com/prometheus/prometheus/tsdb.(*DB).reload()                                                                                     
      /home/roidelapluie/dev/prometheus/tsdb/db.go:958 +0x171f                                                                             
  github.com/prometheus/prometheus/tsdb.(*DB).compactHead()                                                                                
      /home/roidelapluie/dev/prometheus/tsdb/db.go:792 +0x17c                                                                              
  github.com/prometheus/prometheus/tsdb.(*DB).Compact()                                                                                    
      /home/roidelapluie/dev/prometheus/tsdb/db.go:763 +0x302                                                                              
  github.com/prometheus/prometheus/tsdb.(*DB).run()                                                                                        
      /home/roidelapluie/dev/prometheus/tsdb/db.go:684 +0x427                                                                              
                                                                                                                                           
Previous write at 0x00c0002a0840 by goroutine 115:                                                                                         
  sync/atomic.AddInt64()                                                                                                                   
      /home/roidelapluie/godist/go/src/runtime/race_amd64.s:276 +0xb                                                                       
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).write()                                                                  
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:397 +0x9e                                                               
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).writeCRC32()       
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:410 +0xa0                                                               
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk()                                                             
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:282 +0x6ca                                                              
  github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk()                                                                
      /home/roidelapluie/dev/prometheus/tsdb/head.go:2000 +0x1a2
  github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:1974 +0x4f
  github.com/prometheus/prometheus/tsdb.(*memSeries).append()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:2130 +0x781
  github.com/prometheus/prometheus/tsdb.(*headAppender).Commit()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:1200 +0x4d1
  github.com/prometheus/prometheus/tsdb.dbAppender.Commit()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:713 +0x42
  github.com/prometheus/prometheus/tsdb.(*dbAppender).Commit()
      <autogenerated>:1 +0x7f
  github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit()
      /home/roidelapluie/dev/prometheus/storage/fanout.go:184 +0x70

Environment

  • System information:

    Linux 5.6.8-roidelapluie x86_64

  • Prometheus version:

    prometheus, version 2.19.2 (branch: master, revision: d77b56e88e3d554a499e22d2073812b59191256c) build user: roidelapluie@oxygen build date: 20200713-21:07:26 go version: go1.14.4

  • Prometheus configuration file:


# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    scrape_interval: 5s

    static_configs:
    - targets: ['localhost:9090']
    - {targets: ['localhost:1234'], labels: {l: a1}}
    - {targets: ['localhost:1234'], labels: {l: a2}}
    - {targets: ['localhost:1234'], labels: {l: a3}}
    - {targets: ['localhost:1234'], labels: {l: a4}}
    - {targets: ['localhost:1234'], labels: {l: a5}}
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
evel=info ts=2020-07-13T21:22:19.710Z caller=main.go:308 msg="No time or size retention was set so using the default time retention" durat
ion=15d                                                                                                                                    
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:343 msg="Starting Prometheus" version="(version=2.19.2, branch=master, revision=d77b5
6e88e3d554a499e22d2073812b59191256c)"                                                                                                      
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:344 build_context="(go=go1.14.4, user=roidelapluie@oxygen, date=20200713-21:07:26)"  
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:345 host_details="(Linux 5.6.8-roidelapluie #1 SMP Thu Apr 30 23:23:52 CEST 2020 x86_
64 oxygen (none))"                                                                                                                         
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:346 fd_limits="(soft=1024, hard=102400)"                                             
level=info ts=2020-07-13T21:22:19.710Z caller=main.go:347 vm_limits="(soft=unlimited, hard=unlimited)"                                     
level=info ts=2020-07-13T21:22:19.720Z caller=main.go:684 msg="Starting TSDB ..."                                                          
level=info ts=2020-07-13T21:22:19.720Z caller=web.go:524 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-07-13T21:22:19.721Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1594674586974 maxt=1594674600000 u
lid=01ED50TDXJET4ZTWN06G7E2K2N                                                                                                             
level=info ts=2020-07-13T21:22:19.721Z caller=repair.go:59 component=tsdb msg="Found healthy block" mint=1594674780000 maxt=1594674840000 u
lid=01ED50WEHMYESE6RTR69DMJT08                                       
level=info ts=2020-07-13T21:22:19.756Z caller=head.go:645 component=tsdb msg="Replaying WAL and on-disk memory mappable chunks if any, this
 may take a while"                                                   
level=info ts=2020-07-13T21:22:38.306Z caller=head.go:682 component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-07-13T21:22:38.525Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=7 maxSegment=11
level=info ts=2020-07-13T21:22:38.611Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=8 maxSegment=11
level=info ts=2020-07-13T21:22:39.117Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=9 maxSegment=11
level=info ts=2020-07-13T21:22:39.465Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=10 maxSegment=11
level=info ts=2020-07-13T21:22:39.466Z caller=head.go:706 component=tsdb msg="WAL segment loaded" segment=11 maxSegment=11
level=info ts=2020-07-13T21:22:39.466Z caller=head.go:709 component=tsdb msg="WAL replay completed" duration=19.710089391s
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:700 fs_type=EXT4_SUPER_MAGIC
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:701 msg="TSDB started"
level=debug ts=2020-07-13T21:22:40.949Z caller=main.go:702 msg="TSDB options" MinBlockDuration=10s MaxBlockDuration=36h MaxBytes=0B NoLockf
ile=false RetentionDuration=15d WALSegmentSize=0B AllowOverlappingBlocks=false WALCompression=true
level=info ts=2020-07-13T21:22:40.949Z caller=main.go:805 msg="Loading configuration file" filename=prometheus.yml
level=debug ts=2020-07-13T21:22:40.951Z caller=manager.go:226 component="discovery manager scrape" msg="Starting provider" provider=string/
0 subs=[prometheus]               
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:244 component="discovery manager scrape" msg="Discoverer channel closed" provider
=string/0                         
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:226 component="discovery manager notify" msg="Starting provider" provider=string/
0 subs=[config-0]                 
level=info ts=2020-07-13T21:22:40.952Z caller=main.go:833 msg="Completed loading of configuration file" filename=prometheus.yml
level=debug ts=2020-07-13T21:22:40.952Z caller=manager.go:244 component="discovery manager notify" msg="Discoverer channel closed" provider
=string/0                         
level=info ts=2020-07-13T21:22:40.952Z caller=main.go:652 msg="Server is ready to receive web requests."
level=info ts=2020-07-13T21:22:54.275Z caller=head.go:792 component=tsdb msg="Head GC completed" duration=859.497583ms
level=info ts=2020-07-13T21:23:11.828Z caller=head.go:872 component=tsdb msg="WAL checkpoint complete" first=7 last=9 duration=17.528750204
s                                 
level=info ts=2020-07-13T21:23:31.954Z caller=compact.go:495 component=tsdb msg="write block" mint=1594674850000 maxt=1594674860000 ulid=01
ED50YVKKMQWM05Y9EXJ4V37W duration=19.839084786s                      
level=info ts=2020-07-13T21:23:33.353Z caller=head.go:792 component=tsdb msg="Head GC completed" duration=1.015720132s
==================
WARNING: DATA RACE
Read at 0x00c0002a0840 by goroutine 18:
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).Truncate()
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:684 +0x50e
  github.com/prometheus/prometheus/tsdb.(*Head).Truncate()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:796 +0x62c
  github.com/prometheus/prometheus/tsdb.(*DB).reload()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:958 +0x171f
  github.com/prometheus/prometheus/tsdb.(*DB).compactHead()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:792 +0x17c
  github.com/prometheus/prometheus/tsdb.(*DB).Compact()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:763 +0x302
  github.com/prometheus/prometheus/tsdb.(*DB).run()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:684 +0x427
Previous write at 0x00c0002a0840 by goroutine 115:
  sync/atomic.AddInt64()
      /home/roidelapluie/godist/go/src/runtime/race_amd64.s:276 +0xb
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).write()
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:397 +0x9e
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).writeCRC32()
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:410 +0xa0
  github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk()
      /home/roidelapluie/dev/prometheus/tsdb/chunks/head_chunks.go:282 +0x6ca
  github.com/prometheus/prometheus/tsdb.(*memSeries).mmapCurrentHeadChunk()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:2000 +0x1a2
  github.com/prometheus/prometheus/tsdb.(*memSeries).cutNewHeadChunk()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:1974 +0x4f
  github.com/prometheus/prometheus/tsdb.(*memSeries).append()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:2130 +0x781
  github.com/prometheus/prometheus/tsdb.(*headAppender).Commit()
      /home/roidelapluie/dev/prometheus/tsdb/head.go:1200 +0x4d1
  github.com/prometheus/prometheus/tsdb.dbAppender.Commit()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:713 +0x42
  github.com/prometheus/prometheus/tsdb.(*dbAppender).Commit()
      <autogenerated>:1 +0x7f
  github.com/prometheus/prometheus/storage.(*fanoutAppender).Commit()
      /home/roidelapluie/dev/prometheus/storage/fanout.go:184 +0x70
  github.com/prometheus/prometheus/scrape.(*timeLimitAppender).Commit()
      <autogenerated>:1 +0x62
  github.com/prometheus/prometheus/scrape.(*scrapeLoop).report.func1()
      /home/roidelapluie/dev/prometheus/scrape/scrape.go:1293 +0x5b
  github.com/prometheus/prometheus/scrape.(*scrapeLoop).report()
      /home/roidelapluie/dev/prometheus/scrape/scrape.go:1311 +0x4ed
  github.com/prometheus/prometheus/scrape.(*scrapeLoop).run()
      /home/roidelapluie/dev/prometheus/scrape/scrape.go:986 +0xc7b

Goroutine 18 (running) created at: 
  github.com/prometheus/prometheus/tsdb.open()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:640 +0xb2e
  github.com/prometheus/prometheus/tsdb.Open()
      /home/roidelapluie/dev/prometheus/tsdb/db.go:512 +0xdb
  main.openDBWithMetrics()
      /home/roidelapluie/dev/prometheus/cmd/prometheus/main.go:771 +0x120
  main.main.func20()
      /home/roidelapluie/dev/prometheus/cmd/prometheus/main.go:690 +0x337
  github.com/oklog/run.(*Group).Run.func1()
      /home/roidelapluie/dev/prometheus/vendor/github.com/oklog/run/group.go:38 +0x34

Goroutine 115 (running) created at:
  github.com/prometheus/prometheus/scrape.(*scrapePool).sync()
      /home/roidelapluie/dev/prometheus/scrape/scrape.go:422 +0x7db
  github.com/prometheus/prometheus/scrape.(*scrapePool).Sync()
      /home/roidelapluie/dev/prometheus/scrape/scrape.go:378 +0x7fb
  github.com/prometheus/prometheus/scrape.(*Manager).reload.func1()
      /home/roidelapluie/dev/prometheus/scrape/manager.go:197 +0x56
==================
level=error ts=2020-07-13T21:23:37.744Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: write compaction: 
found chunk with minTime: 1594674853491 maxTime: 1594674864204 outside of compacted minTime: 1594674860000 maxTime: 1594674870000"
level=error ts=2020-07-13T21:23:42.899Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: write compaction: 
found chunk with minTime: 1594674853491 maxTime: 1594674864204 outside of compacted minTime: 1594674860000 maxTime: 1594674870000"
^Clevel=warn ts=2020-07-13T21:23:47.779Z caller=main.go:530 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:553 msg="Stopping scrape discovery manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:567 msg="Stopping notify discovery manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:589 msg="Stopping scrape manager..."
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:549 msg="Scrape discovery manager stopped"
level=info ts=2020-07-13T21:23:47.780Z caller=main.go:563 msg="Notify discovery manager stopped"
level=info ts=2020-07-13T21:23:48.124Z caller=manager.go:888 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-07-13T21:23:48.124Z caller=main.go:583 msg="Scrape manager stopped"
level=info ts=2020-07-13T21:23:48.124Z caller=manager.go:898 component="rule manager" msg="Rule manager stopped"
level=error ts=2020-07-13T21:23:48.975Z caller=db.go:685 component=tsdb msg="compaction failed" err="persist head block: 2 errors: write co
mpaction: add symbol: context canceled; context canceled"            
level=info ts=2020-07-13T21:23:49.019Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..."
level=info ts=2020-07-13T21:23:49.020Z caller=main.go:755 msg="Notifier manager stopped"
level=info ts=2020-07-13T21:23:49.020Z caller=main.go:767 msg="See you next time!"
Found 1 data race(s)

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

Checking once again, curFileSequence could only be changed with writePathMtx held. So I guess there is no issue here.

@codesome it is fixed by using a mutex to protect taking length of a map operation.