MSYS2-packages: Investigation into the mingw-w64 AVX AND AVX2 misalignment bug.

Hello.

Please upgrade the following setups from msys2 home page:

msys2-i686-20161025.exe
msys2-x86_64-20161025.exe

This will prevent beginners from thinking that the project was left in 2016.

Thank you!

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 27 (11 by maintainers)

Most upvoted comments

FWIW, all of this is very much unrelated to the original issue, while this thread now is hijacked for a completely different matter.

As mentioned in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54412, clang and MSVC don’t seem to have the same issue. An adjusted version of the example code, avxalign.c:

#include <immintrin.h>

void foo(__m256 x);

void func(void)
{
    __m256 r = _mm256_set1_ps(0.0f);
    foo(r);
}

If compiled with GCC:

$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign.c -o -
func:   
        subq    $72, %rsp
        .seh_stackalloc 72
        .seh_endprologue
        vxorps  %xmm0, %xmm0, %xmm0
        leaq    32(%rsp), %rcx
        vmovaps %ymm0, 32(%rsp)
        vzeroupper
        call    foo
        nop
        addq    $72, %rsp
        ret

This doesn’t align the pointer where the argument is stored, and writes into it with vmovaps.

With clang:

$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o -
func:                                   # @func
.seh_proc func
        pushq   %rbp
        .seh_pushreg 5
        subq    $80, %rsp
        .seh_stackalloc 80
        leaq    80(%rsp), %rbp
        .seh_setframe 5, 80
        .seh_endprologue
        andq    $-32, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 32(%rsp)
        leaq    32(%rsp), %rcx
        vzeroupper
        callq   foo
        nop
        movq    %rbp, %rsp
        popq    %rbp
        retq

This overallocates stack space in order to be able to align it, and then writes into it with an aligned write.

If not using SEH, by adding -fdwarf-exceptions, it produces different code that also does the alignment:

$ clang -target x86_64-w64-mingw32 -S -O2 -mavx avxalign.c -o - -fdwarf-exceptions
func:                                   # @func
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        andq    $-32, %rsp
        subq    $96, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 32(%rsp)
        leaq    32(%rsp), %rcx
        vzeroupper
        callq   foo
        movq    %rbp, %rsp
        popq    %rbp
        retq

This is what MSVC produces:

$ cl -c -O2 avxalign.c
$ x86_64-w64-mingw32-objdump -d avxalign.obj
0000000000000000 <func>:
   0:   40 55                   rex push %rbp
   2:   48 83 ec 60             sub    $0x60,%rsp
   6:   48 8d 6c 24 40          lea    0x40(%rsp),%rbp
   b:   48 83 e5 e0             and    $0xffffffffffffffe0,%rbp
   f:   c5 fc 10 05 00 00 00    vmovups 0x0(%rip),%ymm0        # 17 <func+0x17>
  16:   00 
  17:   c5 fc 11 45 00          vmovups %ymm0,0x0(%rbp)
  1c:   48 8d 4d 00             lea    0x0(%rbp),%rcx
  20:   c5 f8 77                vzeroupper 
  23:   e8 00 00 00 00          callq  28 <func+0x28>
  28:   48 83 c4 60             add    $0x60,%rsp
  2c:   5d                      pop    %rbp
  2d:   c3                      retq   

This both aligns the pointer, and uses unaligned stores to write it onto the stack

However, this only seems to be an issue when passing such variables by value. Local variables seem to be properly aligned even with GCC:

$ cat avxalign2.c 
#include <immintrin.h>

void foo(__m256 *x);

void func(void)
{
    __m256 r = _mm256_set1_ps(0.0f);
    foo(&r);
}
$ x86_64-w64-mingw32-gcc -S -O2 -mavx avxalign2.c -o -
func:
        pushq   %rbp
        .seh_pushreg    %rbp
        movq    %rsp, %rbp
        .seh_setframe   %rbp, 0
        subq    $32, %rsp
        .seh_stackalloc 32
        .seh_endprologue
        vxorps  %xmm0, %xmm0, %xmm0
        subq    $64, %rsp
        leaq    63(%rsp), %rcx
        andq    $-32, %rcx
        vmovaps %ymm0, (%rcx)
        vzeroupper
        call    foo
        nop
        movq    %rbp, %rsp
        popq    %rbp
        ret

Here the local variable is properly aligned.