prometheus: Data corruption on OpenBSD
Hello guys
I wanted to run my prometheus on OpenBSD but encountered an issue. So I let the tests run and many of them were failing.
For simplicity I will focus on the TestCreateBlock
test.
The error I encountered is:
$ cd tsdb/
$ GO111MODULE=on go test -run TestCreateBlock
--- FAIL: TestCreateBlock (0.07s)
block_test.go:424:
Error Trace: block_test.go:424
block_test.go:86
Error: Received unexpected error:
2 errors: populate block: add series: read symbols: invalid checksum; read symbols: invalid checksum
compactor write
github.com/prometheus/prometheus/tsdb.(*BlockWriter).Flush
/tmp/prometheus/tsdb/blockwriter.go:109
github.com/prometheus/prometheus/tsdb.CreateBlock
/tmp/prometheus/tsdb/tsdbblockutil.go:70
github.com/prometheus/prometheus/tsdb.createBlock
/tmp/prometheus/tsdb/block_test.go:423
github.com/prometheus/prometheus/tsdb.TestCreateBlock
/tmp/prometheus/tsdb/block_test.go:86
testing.tRunner
/usr/local/go/src/testing/testing.go:1194
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371
Test: TestCreateBlock
I tracked this issue down to the function finishSymbols().
On line L537 the value "hash"
is written to the index file.
On line L544 the file is opened via mmap.
On line L552 the actual CRC32 checksum is written to the file, overwriting the previously written "hash"
.
On line L557 the mmap contents are fed to NewSymbols()
.
Now since OpenBSD lacks a unified buffer cache for mmap the checksum gets written to the underlying file descriptor but is not synced into the mmaped region. The mmap still contains the value "hash"
.
As a quick fix I added this patch, reopening the mmap after the CRC32 checksum write:
diff --git a/tsdb/index/index.go b/tsdb/index/index.go
index a6ade9455..f6698a0cc 100644
--- a/tsdb/index/index.go
+++ b/tsdb/index/index.go
@@ -553,6 +553,16 @@ func (w *Writer) finishSymbols() error {
return err
}
+ // close and reopen mmaped file
+ if err := sf.Close(); err != nil {
+ return err
+ }
+ sf, err = fileutil.OpenMmapFile(w.f.name)
+ if err != nil {
+ return err
+ }
+ w.symbolFile = sf
+
// Load in the symbol table efficiently for the rest of the index writing.
w.symbols, err = NewSymbols(realByteSlice(w.symbolFile.Bytes()), FormatV2, int(w.toc.Symbols))
if err != nil {
Now letting the tests run again I encountered another issue:
GO111MODULE=on go test -run TestCreateBlock
--- FAIL: TestCreateBlock (0.07s)
block_test.go:424:
Error Trace: block_test.go:424
block_test.go:86
Error: Received unexpected error:
series not 16-byte aligned at 43
compactor write
github.com/prometheus/prometheus/tsdb.(*BlockWriter).Flush
/tmp/prometheus/tsdb/blockwriter.go:109
github.com/prometheus/prometheus/tsdb.CreateBlock
/tmp/prometheus/tsdb/tsdbblockutil.go:70
github.com/prometheus/prometheus/tsdb.createBlock
/tmp/prometheus/tsdb/block_test.go:423
github.com/prometheus/prometheus/tsdb.TestCreateBlock
/tmp/prometheus/tsdb/block_test.go:86
testing.tRunner
/usr/local/go/src/testing/testing.go:1194
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1371
Test: TestCreateBlock
I tracked this error down to the function writePostingsToTmpFiles().
On line L806 the file is flushed to the disk. On line L809 a new mmap is opened for the index file.
The issue now is because of an already open mmap to the index file in another place of the program. The mmap of line L809 reuses the already open mmap and does not create a new one. This old mmap does not contain the changes flushed to the file from line L806.
For better understanding I created a simple PoC which needs to be run on OpenBSD. I also tried playing with msync(2) which does not seem to have any effect on syncing FD writes back to the mmaped regoin.
package main
import (
"fmt"
"github.com/edsrzf/mmap-go"
"io/ioutil"
"os"
"syscall"
"unsafe"
)
func getMMapedFile(filename string, filesize int) ([]byte, mmap.MMap, error) {
file, err := os.Open(filename)
if err != nil {
return nil, nil, err
}
fileAsBytes, err := mmap.Map(file, mmap.RDONLY, 0)
if err != nil {
return nil, nil, err
}
return fileAsBytes, fileAsBytes, err
}
const (
msAsync = 1 << iota
msSync
msInvalidate
)
func msync(buf []byte) error {
_, _, errno := syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&buf[0])), uintptr(len(buf)), msInvalidate)
if errno != 0 {
return errno
}
return nil
}
func main() {
file, err := ioutil.TempFile("", "mmapedFile")
if err != nil {
fmt.Println(err)
}
filename := file.Name()
defer os.Remove(filename)
// first write
file.Write([]byte("1234567890"))
buf, mm, err := getMMapedFile(filename, 10)
if err != nil {
fmt.Println(err)
}
// mmap has file contents
fmt.Println(string(buf))
// second write
file.WriteAt([]byte("AAAA"), 5)
msync(buf) // no effect
// mmap has old content
fmt.Println(string(buf))
// closing and reopening mmap
mm.Unmap()
buf, mm, err = getMMapedFile(filename, 10)
// mmap has new content
fmt.Println(string(buf))
// third write
file.WriteAt([]byte("CCCC"), 0)
// mm.Unmap() // keeping mmap open
// opening another mmap
buf2, _, _ := getMMapedFile(filename, 10)
msync(buf2) // no effect
// mmap has old contents as long as the same file is still opened in another mmap
fmt.Println(string(buf2))
}
Now I am a bit stuck on how to continue from here, since I don’t know much about the tsdb internals and how/where the mmaps are kept open all over the place. I hope the explanation is well understandable.
Greetings, ston1th
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 17 (5 by maintainers)
Commits related to this issue
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Explanation: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mmap... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Explanation: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mmap... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- tsdb: index write via mmap This is a possible fix for the issues #8799 and #8877. Changes made: * introduce two mmaps: * mmapRw: to read and write an mmaped region * mmapRo: to read from an mma... — committed to ston1th/prometheus by ston1th 3 years ago
- Update prometheus to 2.37.1 (2.37 is a LTS branch) This still includes a patch to workaround the UBC issues in TSDB. For more info check out https://github.com/prometheus/prometheus/issues/8799 and h... — committed to openbsd/ports by cjeker 2 years ago
Thank you so much for deep diving and trying to fix this. I use Prometheus on OpenBSD extensively 😃
prometheus.ports.tar.gz
This modified ports of OpenBSD apply the PR patch that remove mmap reference, so anyone can test ( so far the compacting occurs , more test will follow )
I dug up the internet a bit.
On internet there’s a test about the way ‘everyone’ use a mix of mmap and read/write and expect that the kernel treat it as the same.
[ censored rant ]
https://lists.samba.org/archive/samba-technical/2001-May/013552.html
thank you Tridge
Reading https://man.openbsd.org/msync.2 this , imho cannot work - for legit good reason -.
The above program mmap the file then use read/write on the open fd to check in modifying map actually modified the map , against the documentation ( write to map are not guaranteed until msync )
and in https://misc.openbsd.narkive.com/Hp3vXMls/will-mmap-and-the-read-buffer-cache-be-unified-anyone-working-with-it OpenBSD mailing list is pointing to a solution clearly :
Thank you Stuart Henderson.
So yeah telling the OS I did stuff in mmap pls put it in the actual file FD and telling the OS, I wrote the FD please update mmap
works perfectly
$ ./fun test 1 test 2 (amd64 arch)
I am not sure why it is expected that accessing a similar resource multiple different way is expected to work nor why you would mix up API and I am not fluent in GO.
I put this here so the root problem is somewhat documented. I will now look at the go thingy, unsure why would go mix up read/write that way though.
https://mobile.twitter.com/annoyedobrien