go: runtime: Darwin slow when using signals + int->float instructions

Split off from #37121

package bench1

import (
	"math"
	"testing"
)

const N = 64

func BenchmarkFast(b *testing.B) {
	var x, y, z [N]float32

	for i := 0; i < b.N; i++ {
		mulFast(&x, &y, &z)
	}
}

func mulFast(x, y, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = x[i] * y[i]
	}
}

func BenchmarkSlow(b *testing.B) {
	var z [N]float32
	var x, y [N]uint32

	for i := 0; i < b.N; i++ {
		mulSlow(&x, &y, &z)
	}
}

func mulSlow(x, y *[N]uint32, z *[N]float32) {
	for i := 0; i < N; i++ {
		z[i] = math.Float32frombits(x[i]) * math.Float32frombits(y[i])
	}
}

% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x 
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        55.9 ns/op
BenchmarkSlow-16    	10000000	        61.1 ns/op
PASS
% ~/go1.12.9/bin/go test bench1_test.go -test.bench .\* -test.benchtime=10000000x -test.cpuprofile=cpu.prof
goos: darwin
goarch: amd64
BenchmarkFast-16    	10000000	        89.7 ns/op
BenchmarkSlow-16    	10000000	       223 ns/op
PASS

For some strange reason, code that includes int->float instructions runs a lot slower when profiling is on.

This bug is reproducible from at least 1.11.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 4
Comments: 28 (25 by maintainers)

Commits related to this issue

runtime: zero upper bit of Y registers in asyncPreempt on darwin/amd64 Apparently, the signal handling code path in darwin kernel leaves the upper bits of Y registers in a dirty state, which causes m... — committed to golang/go by cherrymui 4 years ago

Most upvoted comments

We have an office full of people playing musical VZEROUPPERs trying to figure out what the problem is. A whole bunch of things don’t help, we think the bug is in Darwin…

dr2chase on Feb 11, 2020

I reported a bug to apple here. We’ll see if we get any traction on that.

randall77 on Feb 12, 2020

Here’s a C/assembly reproducer:

main.c

#include <stdio.h>
#include <stdint.h>
#include <signal.h>

#define N 64
#define SIG SIGFPE

// Does z = y+float64(x) for 64 entries. This is the main benchmark.
void add(uint64_t *x, double *y, double *z);

uint64_t x[N];
double y[N];
double z[N];

void handler(int sig, siginfo_t *info, void *arg) {
}

int main(int argc, char *argv[]) {
  // Install handler for a signal.
  // sa_sigaction is required; sa_handler doesn't trigger the bug.
  struct sigaction act;
  act.sa_sigaction = handler;
  act.sa_flags |= SA_SIGINFO;
  sigaction(SIG, &act, NULL);
  // Raise the signal, which dirties the upper bits of YMM registers.
  raise(SIG);

  // Run the benchmark.
  for (int i = 0; i < 40000000; i++) {
    add(&x[0], &y[0], &z[0]);
  }
  return 0;
}

add.s:

	.globl	_add
_add:
	// Uncomment the vzeroupper to "fix" the bug.
	//vzeroupper
	movq	$0, %rax
loop:
	cmpq	$64, %rax
	je	done
	movq	(%rdi, %rax, 8), %rcx
	movq	%rcx, %xmm0
	addsd	(%rsi, %rax, 8), %xmm0
	movsd	%xmm0, (%rdx, %rax, 8)
	incq	%rax
	jmp	loop

done:
	ret

randall77 on Feb 11, 2020

On our 1.11 builders, my C repro is ~14% slower with the VZEROUPPER commented out. On 1.12, ~13% slower. On 1.14, ~11% faster. On 1.15, ~11% faster.

So for some reason this bug doesn’t appear on the builders like it does on my desktop. I could imagine that some virtualization layer hides the bug, and the VZEROUPPER instruction has a cost.

randall77 on Feb 12, 2020

Yes, #41152 is all we need now.

randall77 on Dec 4, 2020

Ok, sounds fixed. So once we stop supporting releases <10.15.6, we can get rid of the VZEROUPPER patch. We need a way to ask googlebot to reopen issues triggered on a minimum supported OS level.

randall77 on Aug 31, 2020

@jyknight suggests lack of a vzeroupper, perhaps in the Darwin signal handler.

dr2chase on Feb 11, 2020