runtime: BroadcastScalarToVector256(byte*) broadcasts not correct value?

From https://twitter.com/HaroldAptroot/status/1099389327245828096

public static void Main(string[] args)
{
    Console.WriteLine(test(128));
}

static unsafe Vector256<byte> test(byte v)
{
    Vector256<byte> x = Avx2.BroadcastScalarToVector256(&v);
    return x;
}

Outputs for debug

<224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224, 224>

Outputs for release

<32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32>

Produces the following asm

; Lcl frame size = 0

G_M25726_IG01:
       vzeroupper 
       nop      
       mov      dword ptr [rsp+10H], edx

G_M25726_IG02:
       lea      rax, bword ptr [rsp+10H]
       vpbroadcastb ymm0, yrax
       vmovupd  ymmword ptr[rcx], ymm0
       mov      rax, rcx

G_M25726_IG03:
       vzeroupper 
       ret      

; Total bytes of code 30, prolog size 5 for method Program:test(ubyte):struct

It inlines into the Main method producing the following asm


G_M42296_IG01:
       sub      rsp, 88
       vzeroupper 
       vmovaps  qword ptr [rsp+40H], xmm6
       vmovaps  qword ptr [rsp+30H], xmm7

G_M42296_IG02:
       mov      dword ptr [rsp+2CH], 128
       lea      rcx, bword ptr [rsp+2CH]
       vpbroadcastb ymm6, yrcx
       mov      rcx, 0xD1FFAB1E
       vextractf128 ymm7, ymm6, 1
       call     CORINFO_HELP_NEWSFAST
       vinsertf128 ymm6, ymm7, 1
       vmovupd  ymmword ptr[rax+8], ymm6
       mov      rcx, rax
       call     Console:WriteLine(ref)
       nop      

G_M42296_IG03:
       vmovaps  xmm6, qword ptr [rsp+40H]
       vmovaps  xmm7, qword ptr [rsp+30H]
       vzeroupper 
       add      rsp, 88
       ret      

; Total bytes of code 98, prolog size 19 for method Program:Main()

Contrast with

static unsafe Vector256<byte> test(byte v)
{
    Vector256<byte> x = Vector256.Create(v);
    return x;
}

Which produces

G_M25727_IG02:
       movzx    rax, dl
       vmovd    xmm0, eax
       vpbroadcastb ymm0, ymm0
       vmovupd  ymmword ptr[rcx], ymm0
       mov      rax, rcx

And correctly outputs

<128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128, 128>

/cc @tannergooding @fiigii

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 18 (18 by maintainers)

Most upvoted comments

Not exactly an elegant implementation but it gets the job done as a prototype: https://github.com/mikedn/coreclr/commit/a00b42e9442ceeeacda7981bbb462bb8ec7a8a21

        Avx2.Store(p, Avx2.ExtractVector128(Avx2.Add(x, y), 0));

generates

       C5FD1001             vmovupd  ymm0, ymmword ptr[rcx]
       C5FC5802             vaddps   ymm0, ymm0, ymmword ptr[rdx]
       C4C37D190002         vextractf128 xmmword ptr [r8], ymm0, 0

Looking at this (and I’ve likely just forgotten) why do we need both a InsertVector128(V256<T>, V128<T>, byte) and a InsertVector128(V256<T>, T*, byte) overload?

Uh oh, didn’t notice that. I think those should be removed.

I’m also not sure why Gather is marked as MaybeMemoryLoad, since it is always a memory load

Yeah, but it’s also HW_Category_IMM so it can’t be HW_Category_MemoryLoad as well. That’s fishy but I don’t know if there’s a better way. It seems that HW_Category_MemoryLoad should perhaps be a flag and not a category.