go: runtime: possible memory corruption on FreeBSD

Several failures in the last month on FreeBSD builders have failure modes that are very difficult to explain, such as SIGSEGVs in hot runtime paths accessing values that don’t quite make sense (e.g. spanOf sees a bad arena entry, fixalloc accesses an out-of-bounds slot, a broken return PC in a runtime stack frame). I suspect they share an underlying cause. Three issues have already been opened for these: #45977, #46103, #46182.

As far as I know, these all seem to be specific to FreeBSD, and even more specifically, the “race” and “12_2” builders.

The relevant logs are available below.

2021-05-04T20:50:35-d19e549/freebsd-amd64-race 2021-05-10T23:42:56-5c48951/freebsd-amd64-12_2 2021-05-14T16:42:01-3d324f1/freebsd-amd64-race

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 86 (59 by maintainers)

Commits related to this issue

Most upvoted comments

Extracting more bits from that failure log.

There are two tests failures, both using the same global test data table. They both failed on the 100,000 zero-byte entry. At first glance it looks like they failed in different ways, because they report different hash values, but they did not. The first is reporting the hash of all 100,000 bytes, while the second is reporting the hash of only the first half - 50,000 bytes. Both of them make copies in a string -> []byte conversion before running the hash code, but the error prints the original string. And the reported hashes for both the full and half string are correct for the corrupted string.

This suggests the string in the table, built at init time with strings.Repeat(“\x00”, 1e5), is the problem, not a copy.

Also, GODEBUG=gctrace=1 ./adler32.test -test.short literally doesn’t run a garbage collection. That suggests the GC is not involved, and in particular that the memory in question was not being recycled by Go. (Could be non-GC’ed memory freed and then reused for heap-allocated memory, but since it’s huge and allocated at init time, that seems unlikely as well.)

That means in all likelihood Go mmap’ed new pages for this allocation. It crossed my mind maybe the OS is bad at zeroing, but we would have reinitialized the data in strings.Repeat.

strings.Repeat is interesting. For efficient copying, we do it by doubling the input as far as it will go: 1 byte becomes 2, 2 become 4, and so on, up to 32,768 in this case. Then we will double that, to 65,536, and then we append the first 34,464 bytes once more to complete the 100,000. So that’s why the pattern is there three times: it appeared during the doubling of the first 16kB to 32kB, and then it was copied twice more while tripling the 32kB into 100,000 bytes.

Now the question is: why did the 16kB copy introduce spurious bytes? That happens in strings.Builder.WriteString, which does:

b.buf = append(b.buf, s...)

In this case s is b.buf, as it exists before the append. So there’s no overlap, but the destination is adjacent to (just beyond) the source. That append boils down to runtime.memmove, and again the destination should already be zeroed by the OS, so it’s not that the move was short.

runtime.memmove on freebsd/amd64 depends on runtime.useAVXmemmove to decide whether to use AVX. That is set up as:

isIntelBridgeFamily := isIntel &&
	processor == 0x206A0 ||
	processor == 0x206D0 ||
	processor == 0x306A0 ||
	processor == 0x306E0

useAVXmemmove = cpu.X86.HasAVX && !isIntelBridgeFamily

I’m not sure how to find out what processor we have in that trace. Let’s assume for now that useAVXmemmove is true.

The AVX code starts with:

// There are two implementations of move algorithm.
// The first one for non-overlapped memory regions. It uses forward copying.
// The second one for overlapped regions. It uses backward copying
MOVQ	DI, CX
SUBQ	SI, CX
// Now CX contains distance between SRC and DEST
CMPQ	CX, BX
// If the distance lesser than region length it means that regions are overlapped
JC	copy_backward

// Non-temporal copy would be better for big sizes.
CMPQ	BX, $0x100000
JAE	gobble_big_data_fwd

CX and BX will be equal in this case, so as I read that, there should be no carry in JC and we fall through to the forward loop. And it’s not big enough for the big data case. So we end up in this nicely commented algorithm, which handles the end of the buffer specially:

// Memory layout on the source side
// SI                                       CX
// |<---------BX before correction--------->|
// |       |<--BX corrected-->|             |
// |       |                  |<--- AX  --->|
// |<-R11->|                  |<-128 bytes->|
// +----------------------------------------+
// | Head  | Body             | Tail        |
// +-------+------------------+-------------+
// ^       ^                  ^
// |       |                  |
// Save head into Y4          Save tail into X5..X12
//         |
//         SI+R11, where R11 = ((DI & -32) + 32) - DI
// Algorithm:
// 1. Unaligned save of the tail's 128 bytes
// 2. Unaligned save of the head's 32  bytes
// 3. Destination-aligned copying of body (128 bytes per iteration)
// 4. Put head on the new place
// 5. Put the tail on the new place
// It can be important to satisfy processor's pipeline requirements for
// small sizes as the cost of unaligned memory region copying is
// comparable with the cost of main loop. So code is slightly messed there.
// There is more clean implementation of that algorithm for bigger sizes
// where the cost of unaligned part copying is negligible.
// You can see it after gobble_big_data_fwd label.

Here’s the assembly:

	LEAQ	(SI)(BX*1), CX
	MOVQ	DI, R10
	// CX points to the end of buffer so we need go back slightly. We will use negative offsets there.
	MOVOU	-0x80(CX), X5
	MOVOU	-0x70(CX), X6
	MOVQ	$0x80, AX
	// Align destination address
	ANDQ	$-32, DI
	ADDQ	$32, DI
	// Continue tail saving.
	MOVOU	-0x60(CX), X7
	MOVOU	-0x50(CX), X8
	// Make R11 delta between aligned and unaligned destination addresses.
	MOVQ	DI, R11
	SUBQ	R10, R11
	// Continue tail saving.
	MOVOU	-0x40(CX), X9
	MOVOU	-0x30(CX), X10
	// Let's make bytes-to-copy value adjusted as we've prepared unaligned part for copying.
	SUBQ	R11, BX
	// Continue tail saving.
	MOVOU	-0x20(CX), X11
	MOVOU	-0x10(CX), X12
	// The tail will be put on its place after main body copying.
	// It's time for the unaligned heading part.
	VMOVDQU	(SI), Y4
	// Adjust source address to point past head.
	ADDQ	R11, SI
	SUBQ	AX, BX
	// Aligned memory copying there
gobble_128_loop:
	VMOVDQU	(SI), Y0
	VMOVDQU	0x20(SI), Y1
	VMOVDQU	0x40(SI), Y2
	VMOVDQU	0x60(SI), Y3
	ADDQ	AX, SI
	VMOVDQA	Y0, (DI)
	VMOVDQA	Y1, 0x20(DI)
	VMOVDQA	Y2, 0x40(DI)
	VMOVDQA	Y3, 0x60(DI)
	ADDQ	AX, DI
	SUBQ	AX, BX
	JA	gobble_128_loop
	// Now we can store unaligned parts.
	ADDQ	AX, BX
	ADDQ	DI, BX
	VMOVDQU	Y4, (R10)
	VZEROUPPER
	MOVOU	X5, -0x80(BX)
	MOVOU	X6, -0x70(BX)
	MOVOU	X7, -0x60(BX)
	MOVOU	X8, -0x50(BX)
	MOVOU	X9, -0x40(BX)
	MOVOU	X10, -0x30(BX)
	MOVOU	X11, -0x20(BX)
	MOVOU	X12, -0x10(BX)
	RET

So the bytes that ended up all FF are the ones that were stored in X8 through X12.

Sounds like something in FreeBSD is smashing those registers out from under us?

I got through 912 iterations of all.bash on 12_3 and 898 on 12_2. I got two failures that appear to be memory corruption on 12_2 and zero on 12_3. That alone isn’t enough to base particularly strong statistical conclusions on, I’m afraid, so I’m going to take a Bayesian out and say that in combination with the fact that we’re already pretty sure the bug was fixed, we can close this issue as fixed.

Thank you everyone!

The FreeBSD fix has been cherry-picked to stable/13 in https://github.com/freebsd/freebsd-src/commit/1d6ebddb62bc18833b21a15f8e7eb86102140100 and to stable/12 in https://github.com/freebsd/freebsd-src/commit//7e45b72844768d7fd5c3c4d4e29f4e84b4bc0402. We release regular weekly snapshot builds from these branches; I will leave a note here when it is available in an errata update.

I do not think right now, that the context switch is vulnerable.

That said, why do you want an app workaround while it is OS issue, and should be fixed in OS. The fix is already in HEAD and will be merged to stable branches shortly. Most likely, there will be errata notices for supported releases.

It only matters when there are two factors in play

  1. CPU supports so called init optimization for XSAVE, which not all CPUs do. And when they do, ‘according to complex internal microarchitecural conditions’, the optimization might happens or not. Basically, this means that sometimes CPU does not write the portion of the state on XSAVE, and record in xstate_bv that it did not.
  2. On the signal was delivery, OS provides the saved context, interrupted by the signal, to the signal handler. Context includes all CPU state available to userspace, including FPU bits (XSAVE area). Also, on the return from the signal handler, context is restored, which allows the handler to modify execution of the main program flow. In case when init optimization kicked in, OS tries to hide init state optimization from the signal handler, by filling non-saved parts of the XSAVE area.

This is where the problem happen. For states parts 0 (x87) and 1 (SSE/XMM), Intel does not provide enumeration of layout in CPUID, assuming that OS knows about the regions anyway. The bug was that amd64 kernel hardcoded 32bit size of XMM save area, effectively filling %XMM8-%XMM15 with garbage on signal return when init optimization kicked in, because only specified part of the SSE save area was copied from the canonical copy.

So to get this bug you must have CPU that does the init optimization at all (not just claim that it has it), and also you need to get a signal delivered at right time.

I would not expect this to be fixed in a stock FreeBSD 12.3 release image (but should be fixed by an errata update available later today via FreeBSD-update).

From the change referenced above (https://golang.org/cl/375695) it looks like Go has an entry host-freebsd-12_3 which uses image freebsd-amd64-123-stable-20211230 which is a snapshot built from the stable/12 branch in late Dec 2021, it will indeed have the fix.

The distinction is likely immaterial for Go CI, but in case anyone encounters this issue on a production FreeBSD 12.3 deployment they’ll want to ensure that they’ve updated to at least 12.3-RELEASE-p1.

for i in $(seq 8); do while ./adler32.test; do true; done & done 2>&1 | grep -v PASS

but I don’t know whether that will work or not.

This reproduces pretty consistently on a 13.0-RELEASE-p5 running in a bhyve VM with 6 cores, 8G RAM under an AMD Ryzen 7 5800X.

Also look at the uninitialized memory on this one:

0\x00device\thyperv\ndevice\txenpci\ndevice\tnetmap\ndevice\tevdev\ndevice\tuinput\ndevice\thid\n"

--- FAIL: TestGolden (0.00s)
    adler32_test.go:83: simple implementation: checksum("\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00...\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00device\thyperv\ndevice\txenpci\ndevice\tnetmap\ndevice\tevdev\ndevice\tuinput\ndevice\thid\n") = 0x25891cb1 want 0x86af0001

The failure in https://build.golang.org/log/46d12b0622c0a0cc8f7ede6959235b1b06f6ecef shows the actual corrupted memory. In an allocation of 100,000 (0x186a0) zeros, these indexes ended up being 0xFF instead:

7fb0 - 7fff ffb0 - ffff 17fb0 - 17fff

!!!

For information, the freebsd-amd64-12_2 builder was recently added as part of #44431, and in that same change, the freebsd-amd64-race builder was updated to also use FreeBSD 12.2 (up from FreeBSD 12.0 previously).

CC @paulzhol, @dmgk, @cagedmantis.

@tarkhil We believe this is fixed by a FreeBSD update. You want to be using at least version 12.3-RELEASE-p1.

We believe this is fixed with FreeBSD 12.3-RELEASE-p1. I’m doing a stress test run on both the before and after builders to verify by running the following with VER=12_2 and VER=12_3:

gopool -pool-path /var/run/user/$UID/gopool-$VER create -setup 'gomote push $VM' freebsd-amd64-$VER
stress2 -o austin/46272-$VER -p 8 -timeout 20m -max-total-runs 1000 gopool -pool-path /var/run/user/$UID/gopool-$VER run 'gomote run $VM go/src/all.bash'

Edited: Clarified that this is 12.3-RELEASE-p1, not just 12.3.

@kostikbel Thanks, that makes perfect sense.

For Go programs, GODEBUG=asyncpreemptoff=1 will be a partial workaround by reducing the number of signals, but not complete.

For a more complete workaround (I’m not sure we want to bother with this) I think we may be able to look at the signal context and if XSTATE_BV[1] (SSE state present) is zero, then we explicitly zero the XMM region of the XSAVE area in the signal context. I think this is OK modulo extra details I missed, per Intel SDM Vol 1, Ch 13.9 “Operation of XSAVEOPT”: “If RFBM[i] = 1, XSTATE_BV[i] is set to the value of XINUSE[i].”

However, I think this would still leave us vulnerable to corruption if the kernel interrupted execution just for a context switch rather than a signal?

@kostikbel sorry for the delay. With your fix applied directly on the 13.0-RELEASE-p5 kernel, neither the gist nor is the ./adler32.test loop trigger under bhyve. Thanks!

XSAVE != XSAVES or XSAVEC. Yes, FreeBSD uses XSAVE/XSAVEOPT for saving FPU context either on context switches or on signal delivery (it is slighly more nuanced, sometimes kernel needs to save FPU registers for other reasons). For signals, it is then copied out to userspace as part of the ucontext, and then copied in and XRSTOR-ed as part of return from the signal handler.

Bad news are that I cannot reproduce the abort from avx_sig neither on my hosts nor in bhyve VM. BTW I updated the gist, there was a stupid typo with the signal generation, also the program also works under Linux. I tested on fresh HEAD and several days old stable/13, which might be relevant. There are some changes that could affect this problem, done around this moment https://cgit.freebsd.org/src/commit/?id=bd9e0f5df681da8b5ef05a587b4b5b07572d3fc2 But then I tried it on 12.2 with no abort as well.

It’s particularly interesting because the hash is running on 100,000 zero bytes, and the low 16 bits of the hash are 1 + sum of the bytes mod 65521, which of course is just 1. But the failure got 0x4fb1 in the low 16 bits. So clearly the bytes were not zeroed.

If only the print statement had not elided the bytes in the middle of the input, although maybe by then they would have printed as zeros.

We have a potentially interesting test case from #49862 that must be memory corruption on freebsd/amd64 in a very small and deterministic hash function. It’s particularly interesting that the function got the same wrong answer two out of the three times we’ve seen it fail: 2021-11-29T19:45:58-f598e29/freebsd-amd64-12_2 and 2021-11-29T22:02:45-1970e3e/freebsd-amd64-13_0.

Well, here’s a theory. We upgraded our builders to 12.2 and 11.4 at the same time. Perhaps its possible that there was some bug fix that was backported to 11.X but happens to be broken?

@problame Back when this first started happening, what FreeBSD did you upgrade from?

@problame Wow, thank you for that reference. That’s incredibly useful. Given the crashes that I’ve poked in that issue, this definitely seems to be the same thing. It’s good to know that this has been happening for y’all since at least January.

@dmitshur I’m not sure this qualifies as a release blocker anymore. Based on zrepl/zrepl#411 I don’t think it’s a new issue with Go 1.17.

FWIW, I think we have been hitting this bug in zrepl since FreeBSD 12.2 was released: https://github.com/zrepl/zrepl/issues/411

The workaround is to pin the process to a single core using cpuset.

Another friendly ping via a weekly release meeting, since this is currently labeled release-blocker.

Since we’re approaching the RC and the porting policy doesn’t include FreeBSD as a first class port, we may need to move in the direction of resolving the release-blocking issue by documenting the known problem, rather than blocking the release on its resolution.