prometheus: Data corruption on OpenBSD

Hello guys

I wanted to run my prometheus on OpenBSD but encountered an issue. So I let the tests run and many of them were failing.

For simplicity I will focus on the TestCreateBlock test.

The error I encountered is:

$ cd tsdb/
$ GO111MODULE=on go test -run TestCreateBlock
--- FAIL: TestCreateBlock (0.07s)
    block_test.go:424: 
        	Error Trace:	block_test.go:424
        	            				block_test.go:86
        	Error:      	Received unexpected error:
        	            	2 errors: populate block: add series: read symbols: invalid checksum; read symbols: invalid checksum
        	            	compactor write
        	            	github.com/prometheus/prometheus/tsdb.(*BlockWriter).Flush
        	            		/tmp/prometheus/tsdb/blockwriter.go:109
        	            	github.com/prometheus/prometheus/tsdb.CreateBlock
        	            		/tmp/prometheus/tsdb/tsdbblockutil.go:70
        	            	github.com/prometheus/prometheus/tsdb.createBlock
        	            		/tmp/prometheus/tsdb/block_test.go:423
        	            	github.com/prometheus/prometheus/tsdb.TestCreateBlock
        	            		/tmp/prometheus/tsdb/block_test.go:86
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1194
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1371
        	Test:       	TestCreateBlock

I tracked this issue down to the function finishSymbols().

On line L537 the value "hash" is written to the index file. On line L544 the file is opened via mmap. On line L552 the actual CRC32 checksum is written to the file, overwriting the previously written "hash". On line L557 the mmap contents are fed to NewSymbols().

Now since OpenBSD lacks a unified buffer cache for mmap the checksum gets written to the underlying file descriptor but is not synced into the mmaped region. The mmap still contains the value "hash".

As a quick fix I added this patch, reopening the mmap after the CRC32 checksum write:

diff --git a/tsdb/index/index.go b/tsdb/index/index.go
index a6ade9455..f6698a0cc 100644
--- a/tsdb/index/index.go
+++ b/tsdb/index/index.go
@@ -553,6 +553,16 @@ func (w *Writer) finishSymbols() error {
                return err
        }
 
+       // close and reopen mmaped file
+       if err := sf.Close(); err != nil {
+               return err
+       }
+       sf, err = fileutil.OpenMmapFile(w.f.name)
+       if err != nil {
+               return err
+       }
+       w.symbolFile = sf
+
        // Load in the symbol table efficiently for the rest of the index writing.
        w.symbols, err = NewSymbols(realByteSlice(w.symbolFile.Bytes()), FormatV2, int(w.toc.Symbols))
        if err != nil {

Now letting the tests run again I encountered another issue:

GO111MODULE=on go test -run TestCreateBlock 
--- FAIL: TestCreateBlock (0.07s)
    block_test.go:424: 
        	Error Trace:	block_test.go:424
        	            				block_test.go:86
        	Error:      	Received unexpected error:
        	            	series not 16-byte aligned at 43
        	            	compactor write
        	            	github.com/prometheus/prometheus/tsdb.(*BlockWriter).Flush
        	            		/tmp/prometheus/tsdb/blockwriter.go:109
        	            	github.com/prometheus/prometheus/tsdb.CreateBlock
        	            		/tmp/prometheus/tsdb/tsdbblockutil.go:70
        	            	github.com/prometheus/prometheus/tsdb.createBlock
        	            		/tmp/prometheus/tsdb/block_test.go:423
        	            	github.com/prometheus/prometheus/tsdb.TestCreateBlock
        	            		/tmp/prometheus/tsdb/block_test.go:86
        	            	testing.tRunner
        	            		/usr/local/go/src/testing/testing.go:1194
        	            	runtime.goexit
        	            		/usr/local/go/src/runtime/asm_amd64.s:1371
        	Test:       	TestCreateBlock

I tracked this error down to the function writePostingsToTmpFiles().

On line L806 the file is flushed to the disk. On line L809 a new mmap is opened for the index file.

The issue now is because of an already open mmap to the index file in another place of the program. The mmap of line L809 reuses the already open mmap and does not create a new one. This old mmap does not contain the changes flushed to the file from line L806.

For better understanding I created a simple PoC which needs to be run on OpenBSD. I also tried playing with msync(2) which does not seem to have any effect on syncing FD writes back to the mmaped regoin.

package main

import (
	"fmt"
	"github.com/edsrzf/mmap-go"
	"io/ioutil"
	"os"
	"syscall"
	"unsafe"
)

func getMMapedFile(filename string, filesize int) ([]byte, mmap.MMap, error) {
	file, err := os.Open(filename)
	if err != nil {
		return nil, nil, err
	}

	fileAsBytes, err := mmap.Map(file, mmap.RDONLY, 0)
	if err != nil {
		return nil, nil, err
	}

	return fileAsBytes, fileAsBytes, err
}

const (
	msAsync = 1 << iota
	msSync
	msInvalidate
)

func msync(buf []byte) error {
	_, _, errno := syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&buf[0])), uintptr(len(buf)), msInvalidate)
	if errno != 0 {
		return errno
	}
	return nil
}

func main() {
	file, err := ioutil.TempFile("", "mmapedFile")
	if err != nil {
		fmt.Println(err)
	}

	filename := file.Name()
	defer os.Remove(filename)
	// first write
	file.Write([]byte("1234567890"))

	buf, mm, err := getMMapedFile(filename, 10)
	if err != nil {
		fmt.Println(err)
	}
	// mmap has file contents
	fmt.Println(string(buf))

	// second write
	file.WriteAt([]byte("AAAA"), 5)
	msync(buf) // no effect
	// mmap has old content
	fmt.Println(string(buf))

	// closing and reopening mmap
	mm.Unmap()
	buf, mm, err = getMMapedFile(filename, 10)
	// mmap has new content
	fmt.Println(string(buf))

	// third write
	file.WriteAt([]byte("CCCC"), 0)

	// mm.Unmap() // keeping mmap open

	// opening another mmap
	buf2, _, _ := getMMapedFile(filename, 10)
	msync(buf2) // no effect
	// mmap has old contents as long as the same file is still opened in another mmap
	fmt.Println(string(buf2))
}

Now I am a bit stuck on how to continue from here, since I don’t know much about the tsdb internals and how/where the mmaps are kept open all over the place. I hope the explanation is well understandable.

Greetings, ston1th

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 17 (5 by maintainers)

Commits related to this issue

Most upvoted comments

Thank you so much for deep diving and trying to fix this. I use Prometheus on OpenBSD extensively 😃

prometheus.ports.tar.gz

This modified ports of OpenBSD apply the PR patch that remove mmap reference, so anyone can test ( so far the compacting occurs , more test will follow )

I dug up the internet a bit.

On internet there’s a test about the way ‘everyone’ use a mix of mmap and read/write and expect that the kernel treat it as the same.

[ censored rant ]

https://lists.samba.org/archive/samba-technical/2001-May/013552.html

thank you Tridge

/*
  trivial test program to see if file IO and mmap are coherent. 
  [tridge at samba.org](http://lists.samba.org/mailman/listinfo/samba-technical), May 2001
*/

#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/stat.h>

#define SIZE 20000

static void test(char *map, int fd, int line)
{
	unsigned char buf1[SIZE], buf2[SIZE];

	memcpy(buf1, map, SIZE);
	lseek(fd, 0, SEEK_SET);
	read(fd, buf2, SIZE);
	if (memcmp(buf1, buf2, SIZE) != 0) {
		int i;
		for (i=0;i<SIZE;i++) {
			if (buf1[i] != buf2[i]) {
				printf("mismatch at %d (%d %d)\n",
				       i, buf1[i], buf2[i]);
			}
		}
		printf("not equal on line %d!\n", line);
		exit(1);
	}
}

#ifndef MAP_FILE
#define MAP_FILE 0
#endif

int main(void)
{
	int fd;
	char *map;
	char b = 0;

	fd = open("test.dat", O_RDWR|O_CREAT|O_TRUNC, 0600);

	lseek(fd, SIZE-1, SEEK_SET);
	write(fd, &b, 1);

	map = mmap(NULL,SIZE,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_FILE, fd,
0);

	map[3000] = 17;
	fsync(fd);
	test(map, fd, __LINE__);

	lseek(fd, 76, SEEK_SET);
	write(fd, &fd, sizeof(fd));
	
	test(map, fd, __LINE__);
	return 0;
}

Reading https://man.openbsd.org/msync.2 this , imho cannot work - for legit good reason -.

The above program mmap the file then use read/write on the open fd to check in modifying map actually modified the map , against the documentation ( write to map are not guaranteed until msync )

and in https://misc.openbsd.narkive.com/Hp3vXMls/will-mmap-and-the-read-buffer-cache-be-unified-anyone-working-with-it OpenBSD mailing list is pointing to a solution clearly :

Stuart Henderson
6 years ago
[Permalink](https://narkive.com/Hp3vXMls.9)
[Post by Tinker](https://misc.openbsd.narkive.com/Hp3vXMls/will-mmap-and-the-read-buffer-cache-be-unified-anyone-working-with-it#post8)
A separate question about combining mmap access and file
If I have a readonly mmap and do fwrite to it, could I use
fsync (or msync or any other call) right after the fwrite, as a tool to
guarantee that the memory mapping interface is up to date?
It needs more than fsync.

Playing around with the simple test program from
https://lists.samba.org/archive/samba-technical/2001-May/013552.html
(at least on amd64) it appears that msync (with either MS_SYNC,
MS_ASYNC or MS_INVALIDATE) is needed after changes are made on the
mmap side of things, and msync with MS_INVALIDATE is needed after
changes done on the file io side of things.

Thank you Stuart Henderson.

So yeah telling the OS I did stuff in mmap pls put it in the actual file FD and telling the OS, I wrote the FD please update mmap
works perfectly

int main(void)
{
        int fd;
        char *map;
        char b = 0;

        fd = open("test.dat", O_RDWR|O_CREAT|O_TRUNC, 0600);

        lseek(fd, SIZE-1, SEEK_SET);
        write(fd, &b, 1);

        map = mmap(NULL,SIZE,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_FILE, fd,
0);

        map[3000] = 17;
        // nop: fsync(fd);
        msync(&map[3000], 1, MS_SYNC);

        printf("test 1\n");
        test(map, fd, __LINE__);

        lseek(fd, 76, SEEK_SET);
        write(fd, &fd, sizeof(fd));

        msync(&map[3000], 1, MS_INVALIDATE);

        printf("test 2\n");
        test(map, fd, __LINE__);
        return 0;
}

$ ./fun test 1 test 2 (amd64 arch)

I am not sure why it is expected that accessing a similar resource multiple different way is expected to work nor why you would mix up API and I am not fluent in GO.

I put this here so the root problem is somewhat documented. I will now look at the go thingy, unsure why would go mix up read/write that way though.

image

https://mobile.twitter.com/annoyedobrien