runtime: Vector128:get_Zero() doesn't inline (or intrinsicify) at crossgen
Nor does Vector128<byte>.Count() or Vector128:AsByte(Vector128`1):Vector128`1
ASCIIUtility.WidenAsciiToUtf16_Sse2 calls Vector128<byte>.Zero
Which ends up reserving stack, making call and reading the stack back to zero a xmm register:
G_M55642_IG05:
lea rcx, [rsp+20H]
call [Vector128`1:get_Zero():Vector128`1]
movaps xmm0, xmmword ptr [rsp+20H]
movaps xmm1, xmm6
punpcklbw xmm1, xmm0
movdqu xmmword ptr [rsi], xmm1
mov rax, rsi
shr rax, 1
and rax, 7
mov edx, 8
sub rdx, rax
mov rax, rdx
sub rbx, 16
Which is quite inefficient
/cc @tannergooding @GrabYourPitchforks
category:cq theme:intrinsics skill-level:expert cost:medium
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 32 (32 by maintainers)
That seems reasonable to do, although we need to take care to put documentation in the code about why this is ok, and notes in the codegen code about this.
Here is what I found so far.
Allowing
Vector128.As,Vector128.AsByte-Vector128.AsUInt64to be unconditionally expanded when compiling S.P.C.dll leads to these resultsIf, in addition to
As*methods, also treat the following as intrinsics -Vector128.Create,Vector128.CreateScalarUnsafe,Vector128.ToScalar:Do this for
Vector128<T>.CountandVector256<T>.CountonlyNote that the code size increase in
GetHashCode()methods is due to loop unrolling kicking in when compiling the methods since the loop boundary becomes constant.Only
Vector128<T>.ZeroAll of these together
As an example, the following is what such change would do for pre-jitted code of
System.Text.ASCIIUtility:NarrowUtf16ToAscii_Sse2(long,long,long):longRight now all
Vector128andVector256are treated as regular calls on x86/x64 per this conditionhttps://github.com/dotnet/runtime/blob/189e1aa8f91632c196fe0e6cb1410a85bb7d2283/src/coreclr/src/zap/zapinfo.cpp#L2173-L2178
@davidwrighton @jkotas
Do you think that we can relax this condition and allow some of
Vector128methods to be treated as intrinsics when compiling System.Private.CoreLib?As far as I understand, Sse and Sse2 are required ISAs on x86/x64 platforms, so it appears to be safe to do this, at least, for the following ones:
Vector128<T>.As*Vector128<T>.get_CountVector128<T>.get_Zero