runtime: String.Replace(char, char) now slower in some cases

Moving discussion from the PR https://github.com/dotnet/runtime/pull/67049

@gfoidl, at least on my machine, comparing string.Replace in .NET 6 vs .NET 7, multiple examples I’ve tried have shown .NET 7 to have regressed, e.g.

const string Input = """
    Whose woods these are I think I know.
    His house is in the village though;
    He will not see me stopping here
    To watch his woods fill up with snow.
    My little horse must think it queer
    To stop without a farmhouse near
    Between the woods and frozen lake
    The darkest evening of the year.
    He gives his harness bells a shake
    To ask if there is some mistake.
    The only other sound’s the sweep
    Of easy wind and downy flake.
    The woods are lovely, dark and deep,
    But I have promises to keep,
    And miles to go before I sleep,
    And miles to go before I sleep.
    """;

[Benchmark]
public string Replace() => Input.Replace('I', 'U');

Method Runtime Mean Ratio Replace .NET 6.0 108.1 ns 1.00 Replace .NET 7.0 136.0 ns 1.26 Do you see otherwise?

@gfoidl

gfoidl commented yesterday Hm, that is not expected…

When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1889 (21H1/May2021Update) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-preview.7.22377.5 [Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT DefaultJob : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT

Method Mean Error StdDev Median Ratio RatioSD Default 142.0 ns 3.48 ns 9.98 ns 138.6 ns 1.00 0.00 PR 132.9 ns 2.68 ns 3.40 ns 132.8 ns 0.92 0.07 so a result I’d expect, as after the vectorized loop 6 chars are remaining that the old-code processes in the for-loop whilst the new-code does one vectorized pass.

I checked the dasm (via DisassemblyDiagnoser of BDN) and that looks OK.

Can this be something from different machine-code layout (loops), PGO, etc. that causes the difference between .NET 6 and .NET 7? How can I investigate this further – need some guidance on how to check code-layout please.

@stephentoub stephentoub commented yesterday • Thanks, @gfoidl. Do you see a similar 6 vs 7 difference as I do? (It might not be specific to this PR.) @EgorBo, can you advise?

@tannergooding tannergooding commented yesterday When i duplicate the string.Replace(char, char)-method in order to compare the old and the new implementation both on .NET 7 then I see

This could be related to stale PGO data

@danmoseley danmoseley commented yesterday Is there POGO data en-route that has trained with this change in place? I am not sure how to follow it.

@danmoseley danmoseley commented yesterday Also, it wouldn’t matter here, but are we consuming POGO data trained on main bits in the release branches?

@stephentoub stephentoub commented yesterday • I don’t think this particular case is related to stale PGO data. I set COMPlus_JitDisablePGO=1, and I still see an ~20% regression from .NET 6 to .NET 7.

@danmoseley danmoseley commented 21 hours ago • I ran the example above with

    var config = DefaultConfig.Instance
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "1"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core31).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"))
        .AddJob(Job.Default.WithRuntime(CoreRuntime.CreateForNewVersion("net7.0", ".NET 7.0")).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0").AsBaseline())
        .AddJob(Job.Default.WithRuntime(ClrRuntime.Net48).WithEnvironmentVariable("COMPlus_JitDisablePGO", "0"));
    BenchmarkRunner.Run(typeof(Program).Assembly, args: args, config: config);

and got

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.856/21H2) Intel Core i7-10510U CPU 1.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK=7.0.100-rc.2.22426.5 [Host] : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-DGTURM : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-PYGDYG : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-ZEPFOF : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-PSEWWK : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256 Job-WGVIGL : .NET 6.0.8 (6.0.822.36306), X64 RyuJIT AVX2 Job-HBSVYM : .NET 7.0.0 (7.0.22.42212), X64 RyuJIT AVX2 Job-VWWZUC : .NET Core 3.1.28 (CoreCLR 4.700.22.36202, CoreFX 4.700.22.36301), X64 RyuJIT AVX2 Job-LDCOEC : .NET Framework 4.8 (4.8.4510.0), X64 RyuJIT VectorSize=256

Method EnvironmentVariables Runtime Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Replace COMPlus_JitDisablePGO=0 .NET 6.0 130.5 ns 6.76 ns 18.51 ns 124.0 ns 0.92 0.17 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET 7.0 144.0 ns 2.95 ns 5.54 ns 142.5 ns 1.00 0.00 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET Core 3.1 822.1 ns 16.09 ns 23.07 ns 814.0 ns 5.69 0.31 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=0 .NET Framework 4.8 750.2 ns 28.86 ns 82.82 ns 730.3 ns 4.97 0.49 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 6.0 127.1 ns 2.64 ns 4.75 ns 126.4 ns 0.88 0.05 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 7.0 144.5 ns 2.96 ns 5.97 ns 144.1 ns 1.01 0.06 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET Core 3.1 936.2 ns 17.96 ns 22.06 ns 931.9 ns 6.50 0.37 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET Framework 4.8 673.2 ns 12.41 ns 23.91 ns 670.5 ns 4.68 0.23 0.3262 1.34 KB 1.00
code https://gist.github.com/danmoseley/c31bc023d6ec671efebff7352e3b3251

(should we be surprised that disabling PGO didn’t make any difference? perhaps it doesn’t exercise this method? cc @AndyAyersMS )

@danmoseley danmoseley commented 21 hours ago and just for interest

Method EnvironmentVariables Runtime Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
Replace COMPlus_JitDisablePGO=1 .NET 6.0 127.8 ns 2.55 ns 5.91 ns 125.8 ns 0.95 0.05 0.3266 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1 .NET 7.0 141.0 ns 2.73 ns 2.42 ns 141.1 ns 1.00 0.00 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 .NET 6.0 163.9 ns 3.35 ns 4.81 ns 163.8 ns 1.15 0.05 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX2=0 .NET 7.0 184.9 ns 3.59 ns 4.79 ns 183.7 ns 1.32 0.05 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 .NET 6.0 176.1 ns 3.44 ns 4.09 ns 175.9 ns 1.25 0.03 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableAVX=0 .NET 7.0 192.1 ns 3.81 ns 4.53 ns 190.1 ns 1.37 0.05 0.3271 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 .NET 6.0 1,057.4 ns 20.95 ns 40.86 ns 1,047.2 ns 7.65 0.35 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableHWIntrinsic=0 .NET 7.0 947.1 ns 13.34 ns 11.83 ns 948.3 ns 6.72 0.15 0.3262 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 .NET 6.0 496.0 ns 51.61 ns 152.17 ns 463.3 ns 3.67 1.67 0.3269 1.34 KB 1.00
Replace COMPlus_JitDisablePGO=1,COMPlus_EnableSSE3=0 .NET 7.0 395.3 ns 14.32 ns 41.10 ns 388.4 ns 2.95 0.27 0.3271 1.34 KB 1.00

@gfoidl gfoidl commented 9 hours ago Do you see a similar 6 vs 7 difference as I do?

Yes (sorry for slow response, was Sunday…). @danmoseley thanks for your numbers.

This is the machine code I get (from BDN) when run @danmoseley’s benchmark (.NET 7 only). Left there some comments.

; Program.Replace()
       mov       rcx,1C003C090A0
       mov       rcx,[rcx]
       mov       edx,49
       mov       r8d,55
       jmp       qword ptr [7FFEFA7430C0]
; Total bytes of code 30

; System.String.Replace(Char, Char)
       push      r15
       push      r14
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       vzeroupper
       mov       rsi,rcx
       mov       edi,edx
       mov       ebx,r8d
       movzx     ecx,di
       movzx     r8d,bx
       cmp       ecx,r8d
       je        near ptr M01_L09
       lea       rcx,[rsi+0C]
       mov       r8d,[rsi+8]
       movsx     rdx,di
       call      qword ptr [7FFEFA7433C0]
       mov       ebp,eax
       test      ebp,ebp
       jge       short M01_L00
       mov       rax,rsi                ; uncommon case, could jump to M01_L09 instead
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L00:
       mov       ecx,[rsi+8]
       sub       ecx,ebp
       mov       r14d,ecx
       mov       ecx,[rsi+8]
       call      System.String.FastAllocateString(Int32)
       mov       r15,rax
       test      ebp,ebp
       jg        near ptr M01_L10       ; should be common path, I don't expect to jump to the end, then back to here
M01_L01:
       mov       eax,ebp
       lea       rax,[rsi+rax*2+0C]
       cmp       [r15],r15b
       mov       edx,ebp
       lea       rdx,[r15+rdx*2+0C]
       xor       ecx,ecx
       cmp       dword ptr [rsi+8],10
       jl        near ptr M01_L07
       movzx     r8d,di
       imul      r8d,10001              ; this is tracked in https://github.com/dotnet/runtime/issues/67038, .NET 6 has the same issue, so no difference expected
       vmovd     xmm0,r8d
       vpbroadcastd ymm0,xmm0           ; should be vpbroadcastb, see comment above
       movzx     r8d,bx
       imul      r8d,10001
       vmovd     xmm1,r8d
       vpbroadcastd ymm1,xmm1           ; vpbroadcastb (see above)
       cmp       r14,10
       jbe       short M01_L03
       add       r14,0FFFFFFFFFFFFFFF0
M01_L02:
       lea       r8,[rax+rcx*2]
       vmovupd   ymm2,[r8]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm3,ymm1         ; the vpand, vpandn, vpor series should be vpblendvb, https://github.com/dotnet/runtime/issues/67039 tracked this
       vpandn    ymm2,ymm3,ymm2         ; the "duplicated code for string.Replace" method emits vpblendvb as expected, but
       vpor      ymm2,ymm4,ymm2         ; if string.Replace from .NET 7.0.0 (7.0.22.42212) (.NET SDK=7.0.100-rc.2.22426.5) is used, then it's this series
       lea       r8,[rdx+rcx*2]
       vmovupd   [r8],ymm2
       add       rcx,10
       cmp       rcx,r14
       jb        short M01_L02
M01_L03:
       mov       ecx,[rsi+8]
       add       ecx,0FFFFFFF0
       add       rsi,0C
       lea       rsi,[rsi+rcx*2]
       vmovupd   ymm2,[rsi]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm0,ymm3,ymm1
       vpandn    ymm1,ymm3,ymm2
       vpor      ymm2,ymm0,ymm1
       lea       rax,[r15+0C]
       lea       rax,[rax+rcx*2]
       vmovupd   [rax],ymm2
       jmp       short M01_L08
M01_L04:
       movzx     r8d,word ptr [rax+rcx*2]
       lea       r9,[rdx+rcx*2]
       movzx     r10d,di
       cmp       r8d,r10d
       je        short M01_L05          ; not relevant for .NET 6 -> .NET 7 comparison in this test-case, but
       jmp       short M01_L06          ; one jump could be avoided?!
M01_L05:
       movzx     r8d,bx
M01_L06:
       mov       [r9],r8w
       inc       rcx
M01_L07:
       cmp       rcx,r14
       jb        short M01_L04
M01_L08:
       mov       rax,r15
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L09:                                ; expect the mov rax,{r15,rsi} the epilogs are the same, can they be collapsed to
       mov       rax,rsi                ; get less machine code?
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L10:                                ; this block should be common enough, so should be on the jump-root (see comment above)
       cmp       [r15],r15b             ; it's the Memmove-call
       lea       rcx,[r15+0C]
       lea       rdx,[rsi+0C]
       mov       r8d,ebp
       add       r8,r8
       call      qword ptr [7FFEFA7399F0]
       jmp       near ptr M01_L01
; Total bytes of code 383

So from code-layout one major difference to .NET 6 is that the call to System.Buffer.Memmove is moved out of the hot-path. But I doubt that this allone is the cause for the regression.

I also wonder why vpblendvb is gone when using string.Replace in the benchmark from .NET-bits. If I use a string.Replace-duplicated code for the benchmark, then it’s emitted which is what I expect as https://github.com/dotnet/runtime/commit/10d8a36ab669ac95f554e5efcc3c8780b5c50f11 got merged on 2022-05-25. But that shouldn’t cause the regression either, as for .NET 6 the same series of vector-instruction are emitted.

The beginning of the method, right after the prolog, looks different between .NET 6 and .NET 7, although this PR didn’t change anything here. I don’t expect that this causes the regression, as with the given input the vectorized loop with 33 iterations should be dominant enough (just my feeling, maybe wrong).

So far the “static analysis”, but I doubt this is enough. With Intel VTune I see some results, but with my interpretation the conclusions are just the same as stated in this comment. I hope some JIT experts can shed some light on this (and give some advices on how to investigate, as I’m eager to learn).

Machine code for .NET 6 (for reference)
; System.String.Replace(Char, Char)
       push      r15
       push      r14
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,28
       vzeroupper
       mov       rsi,rcx
       movzx     edi,dx
       movzx     ebx,r8w
       cmp       edi,ebx
       jne       short M01_L00
       mov       rax,rsi
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L00:
       lea       rbp,[rsi+0C]
       mov       rcx,rbp
       mov       r14d,[rsi+8]
       mov       r8d,r14d
       mov       edx,edi
       call      System.SpanHelpers.IndexOf(Char ByRef, Char, Int32)
       mov       r15d,eax
       test      r15d,r15d
       jge       short M01_L01
       mov       rax,rsi
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
M01_L01:
       mov       esi,r14d
       sub       esi,r15d
       mov       ecx,r14d
       call      System.String.FastAllocateString(Int32)
       mov       r14,rax
       test      r15d,r15d
       jle       short M01_L02
       cmp       [r14],r14d
       lea       rcx,[r14+0C]
       mov       rdx,rbp
       mov       r8d,r15d
       add       r8,r8
       call      System.Buffer.Memmove(Byte ByRef, Byte ByRef, UIntPtr)
M01_L02:
       movsxd    rax,r15d
       add       rax,rax
       add       rbp,rax
       cmp       [r14],r14d
       lea       rdx,[r14+0C]
       add       rdx,rax
       cmp       esi,10
       jl        short M01_L04
       imul      eax,edi,10001
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       imul      eax,ebx,10001
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
M01_L03:
       vmovupd   ymm2,[rbp]
       vpcmpeqw  ymm3,ymm2,ymm0
       vpand     ymm4,ymm1,ymm3
       vpandn    ymm2,ymm3,ymm2
       vpor      ymm2,ymm4,ymm2
       vmovupd   [rdx],ymm2
       add       rbp,20
       add       rdx,20
       add       esi,0FFFFFFF0
       cmp       esi,10
       jge       short M01_L03
M01_L04:
       test      esi,esi
       jle       short M01_L08
       nop       word ptr [rax+rax]
M01_L05:
       movzx     eax,word ptr [rbp]
       mov       rcx,rdx
       cmp       eax,edi
       je        short M01_L06
       jmp       short M01_L07
M01_L06:
       mov       eax,ebx
M01_L07:
       mov       [rcx],ax
       add       rbp,2
       add       rdx,2
       dec       esi
       test      esi,esi
       jg        short M01_L05
M01_L08:
       mov       rax,r14
       vzeroupper
       add       rsp,28
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       pop       r14
       pop       r15
       ret
; Total bytes of code 307

@AndyAyersMS

AndyAyersMS commented 2 hours ago (should we be surprised that disabling PGO didn’t make any difference? perhaps it doesn’t exercise this method? cc @AndyAyersMS )

Hard to say without looking deeper – from the .NET 7 code above I would guess PGO is driving the code layout changes.

For the .NET 7 you can use DOTNET_JitDIsasm in BDN to obtain the jit disasm which will tell you if there was PGO found (at least for the root method).

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 47 (47 by maintainers)

Most upvoted comments

this should be investigated to understand what exactly is causing the regression before we move the milestone to 8.0. last time I was going to investigate these regressions with Moko we discovered that BDN was doing things very differently between 2 runs so he’s been focusing on making the runs repeatable. so I’m going to take another look again with him.

What change? The most recent benchmark examples aren’t using Replace.

Ah, I didn’t see that. Scratch that option then. So we should either confirm it’s not something real code would see, or find a fix.

Seeing possibly similar issues over in #64626 (ignore the PGO aspect; we’re regressed even w/o PGO).

Does disabling GC regions help?

Latest RC2:

Method Mean
WithContent1 121.54 ns
WithContent2 121.04 ns
WithoutContent1 95.10 ns
WithoutContent2 93.52 ns

Latest RC2 w/ DOTNET_GCName=“clrgc.dll”

Method Mean
WithContent1 107.23 ns
WithContent2 108.93 ns
WithoutContent1 87.14 ns
WithoutContent2 84.36 ns

@mangod9, is this expected?

I spent a bit more time running various tests. I suspect this is actually not related to the Replace PR and instead related more to something allocation-related, like regions in .NET 7. I see comparable regressions with these:

const string Input = """
    Whose woods these are I think I know.
    His house is in the village though;
    He will not see me stopping here
    To watch his woods fill up with snow.
    My little horse must think it queer
    To stop without a farmhouse near
    Between the woods and frozen lake
    The darkest evening of the year.
    He gives his harness bells a shake
    To ask if there is some mistake.
    The only other sound’s the sweep
    Of easy wind and downy flake.
    The woods are lovely, dark and deep,
    But I have promises to keep,
    And miles to go before I sleep,
    And miles to go before I sleep.
    """;
private char[] _chars = Input.ToCharArray();

[Benchmark]
public string WithContent1() => new string(_chars);

[Benchmark]
public string WithContent2() => string.Create(Input.Length, Input, (dest, state) => state.AsSpan().CopyTo(dest));

[Benchmark]
public string WithoutContent1() => string.Create(Input.Length, Input, (dest, state) => { });

[Benchmark]
public string WithoutContent2() => new string('\0', Input.Length);
Method Runtime Mean Ratio
WithContent1 .NET 6.0 105.97 ns 1.00
WithContent1 .NET 7.0 120.65 ns 1.15
WithContent2 .NET 6.0 104.12 ns 1.00
WithContent2 .NET 7.0 122.60 ns 1.18
WithoutContent1 .NET 6.0 79.04 ns 1.00
WithoutContent1 .NET 7.0 103.15 ns 1.30
WithoutContent2 .NET 6.0 76.69 ns 1.00
WithoutContent2 .NET 7.0 100.13 ns 1.31

@AndyAyersMS @kunalspathak is there some way we can help determine whether it is an alignment issue?

In both .NET 6 and .NET 7, the vectorized loop is not aligned and I don’t see any JCC erratum coming in the way in those loop.s

.NET 6 code
; Assembly listing for method System.String:Replace(ushort,ushort):System.String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 this         [V00,T08] (  6,  4   )     ref  ->  rsi         this class-hnd single-def
;  V01 arg1         [V01,T09] (  3,  3   )  ushort  ->  rdx         single-def
;  V02 arg2         [V02,T10] (  3,  3   )  ushort  ->   r8         single-def
;  V03 loc0         [V03,T14] (  4,  2   )     int  ->  r15        
;  V04 loc1         [V04,T00] (  9, 25.50)     int  ->  rsi        
;  V05 loc2         [V05,T12] (  6,  3   )     ref  ->  r14         class-hnd single-def
;  V06 loc3         [V06,T15] (  4,  2   )     int  ->  r15        
;  V07 loc4         [V07,T01] (  7, 24.50)   byref  ->  rbp        
;  V08 loc5         [V08,T02] (  7, 24.50)   byref  ->  rdx        
;  V09 loc6         [V09,T26] (  2,  4.50)  simd32  ->  mm0         ld-addr-op
;  V10 loc7         [V10,T27] (  2,  4.50)  simd32  ->  mm1         ld-addr-op
;  V11 loc8         [V11,T23] (  3, 12   )  simd32  ->  mm2        
;  V12 loc9         [V12,T24] (  3, 12   )  simd32  ->  mm3        
;  V13 loc10        [V13,T25] (  2,  8   )  simd32  ->  mm2        
;  V14 loc11        [V14,T03] (  3, 10   )  ushort  ->  rax        
;  V15 OutArgs      [V15    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V16 tmp1         [V16,T04] (  3,  8   )   byref  ->  rcx        
;  V17 tmp2         [V17,T05] (  3,  8   )   byref  ->  rcx        
;  V18 tmp3         [V18,T06] (  3,  8   )     int  ->  rax        
;  V19 tmp4         [V19,T17] (  2,  2   )   byref  ->  rcx         single-def "Inlining Arg"
;  V20 tmp5         [V20,T18] (  2,  2   )   byref  ->  rdx         single-def "Inlining Arg"
;* V21 tmp6         [V21    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;  V22 tmp7         [V22,T21] (  2,  2   )    long  ->   r8         "Inlining Arg"
;* V23 tmp8         [V23    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;* V24 tmp9         [V24    ] (  0,  0   )   byref  ->  zero-ref    single-def "impAppendStmt"
;* V25 tmp10        [V25    ] (  0,  0   )   byref  ->  zero-ref    single-def "impAppendStmt"
;* V26 tmp11        [V26    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V27 tmp12        [V27,T19] (  2,  2   )   byref  ->  rbp         single-def "Inlining Arg"
;  V28 tmp13        [V28,T20] (  2,  2   )   byref  ->  rdx         single-def "Inlining Arg"
;* V29 tmp14        [V29    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg"
;  V30 cse0         [V30,T22] (  3,  1.50)    long  ->  rax         "CSE - moderate"
;  V31 cse1         [V31,T07] (  5,  7   )     int  ->  rdi         "CSE - aggressive"
;  V32 cse2         [V32,T11] (  4,  4.50)     int  ->  rbx         "CSE - moderate"
;  V33 cse3         [V33,T16] (  4,  2   )     int  ->  r14         "CSE - moderate"
;  V34 cse4         [V34,T13] (  4,  2   )   byref  ->  rbp         "CSE - moderate"
;
; Lcl frame size = 40

G_M34983_IG01:
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
       vzeroupper 
       mov      rsi, rcx
						;; bbWeight=1    PerfScore 7.50
G_M34983_IG02:
       movzx    rdi, dx
       movzx    rbx, r8w
       cmp      edi, ebx
       jne      SHORT G_M34983_IG05
						;; bbWeight=1    PerfScore 1.75
G_M34983_IG03:
       mov      rax, rsi
; ............................... 32B boundary ...............................
						;; bbWeight=0.50 PerfScore 0.12
G_M34983_IG04:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       ret      
						;; bbWeight=0.50 PerfScore 2.62
G_M34983_IG05:
       lea      rbp, bword ptr [rsi+12]
       mov      rcx, rbp
       mov      r14d, dword ptr [rsi+8]
       mov      r8d, r14d
       mov      edx, edi
; ............................... 32B boundary ...............................
       call     System.SpanHelpers:IndexOf(byref,ushort,int):int
       mov      r15d, eax
       test     r15d, r15d
       jge      SHORT G_M34983_IG07
       mov      rax, rsi
						;; bbWeight=0.50 PerfScore 3.00
G_M34983_IG06:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       ret      
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ret: 0 ; jcc erratum) 32B boundary ...............................
						;; bbWeight=0.50 PerfScore 2.62
G_M34983_IG07:
       mov      esi, r14d
       sub      esi, r15d
       mov      ecx, r14d
       call     System.String:FastAllocateString(int):System.String
       mov      r14, rax
       test     r15d, r15d
       jle      SHORT G_M34983_IG08
       cmp      dword ptr [r14], r14d
       lea      rcx, bword ptr [r14+12]
       mov      rdx, rbp
; ............................... 32B boundary ...............................
       mov      r8d, r15d
       add      r8, r8
       call     System.Buffer:Memmove(byref,byref,long)
						;; bbWeight=0.50 PerfScore 3.75
G_M34983_IG08:
       movsxd   rax, r15d
       add      rax, rax
       add      rbp, rax
       cmp      dword ptr [r14], r14d
       lea      rdx, bword ptr [r14+12]
       add      rdx, rax
       cmp      esi, 16
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ...............................
       jl       SHORT G_M34983_IG10
       imul     eax, edi, 0x10001
       vmovd    xmm0, eax
       vpbroadcastd ymm0, ymm0
       imul     eax, ebx, 0x10001
       vmovd    xmm1, eax
       vpbroadcastd ymm1, ymm1
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vpbroadcastd: 3) 32B boundary ...............................
       align    [0 bytes]
						;; bbWeight=0.50 PerfScore 8.38
G_M34983_IG09:
       vmovupd  ymm2, ymmword ptr[rbp]
       vpcmpeqw ymm3, ymm2, ymm0
       vpand    ymm4, ymm1, ymm3
       vpandn   ymm2, ymm3, ymm2
       vpor     ymm2, ymm4, ymm2
       vmovupd  ymmword ptr[rdx], ymm2
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vmovupd: 2) 32B boundary ...............................
       add      rbp, 32
       add      rdx, 32
       add      esi, -16
       cmp      esi, 16
       jge      SHORT G_M34983_IG09
						;; bbWeight=4    PerfScore 42.00
G_M34983_IG10:
       test     esi, esi
       jle      SHORT G_M34983_IG15
       align    [10 bytes]
; ............................... 32B boundary ...............................
						;; bbWeight=0.50 PerfScore 0.75
G_M34983_IG11:
       movzx    rax, word  ptr [rbp]
       mov      rcx, rdx
       cmp      eax, edi
       je       SHORT G_M34983_IG13
						;; bbWeight=4    PerfScore 14.00
G_M34983_IG12:
       jmp      SHORT G_M34983_IG14
						;; bbWeight=2    PerfScore 4.00
G_M34983_IG13:
       mov      eax, ebx
						;; bbWeight=2    PerfScore 0.50
G_M34983_IG14:
       mov      word  ptr [rcx], ax
       add      rbp, 2
       add      rdx, 2
       dec      esi
       test     esi, esi
       jg       SHORT G_M34983_IG11
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (jg: 0 ; jcc erratum) 32B boundary ...............................
						;; bbWeight=4    PerfScore 12.00
G_M34983_IG15:
       mov      rax, r14
						;; bbWeight=0.50 PerfScore 0.12
G_M34983_IG16:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       ret      
						;; bbWeight=0.50 PerfScore 2.62

; Total bytes of code 307, prolog size 18, PerfScore 136.45, instruction count 106, allocated bytes for code 307 (MethodHash=695a7758) for method System.String:Replace(ushort,ushort):System.String:this
; ============================================================

.NET 7 code
; Assembly listing for method System.String:Replace(ushort,ushort):System.String:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; fully interruptible
; with Static PGO: edge weights are invalid, and fgCalledCount is 47166
; 1 inlinees with PGO data; 16 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T00] ( 13,  5.10)     ref  ->  rsi         this class-hnd single-def
;  V01 arg1         [V01,T01] (  6,  4.00)  ushort  ->  rdi         single-def
;  V02 arg2         [V02,T02] (  5,  3   )  ushort  ->  rbx         single-def
;  V03 loc0         [V03,T03] (  4,  2.21)     int  ->  rbp        
;  V04 loc1         [V04,T06] (  4,  0.10)    long  ->  r14        
;  V05 loc2         [V05,T05] (  7,  0.10)     ref  ->  r15         class-hnd single-def
;  V06 loc3         [V06,T04] (  5,  0.21)     int  ->  rbp        
;  V07 loc4         [V07,T13] (  3,  0   )   byref  ->  rax         single-def
;  V08 loc5         [V08,T14] (  3,  0   )   byref  ->  rdx         single-def
;  V09 loc6         [V09,T07] ( 14,  0   )    long  ->  rcx        
;  V10 loc7         [V10,T17] (  3,  0   )  simd32  ->  mm0         ld-addr-op
;  V11 loc8         [V11,T18] (  3,  0   )  simd32  ->  mm1         ld-addr-op
;  V12 loc9         [V12,T08] (  6,  0   )  simd32  ->  mm2        
;  V13 loc10        [V13,T09] (  6,  0   )  simd32  ->  mm3        
;  V14 loc11        [V14,T12] (  4,  0   )  simd32  ->  mm2        
;  V15 loc12        [V15,T25] (  2,  0   )    long  ->  r14        
;  V16 loc13        [V16,T19] (  3,  0   )  ushort  ->   r8        
;  V17 OutArgs      [V17    ] (  1,  1   )  lclBlk (32) [rsp+00H]   "OutgoingArgSpace"
;  V18 tmp1         [V18,T15] (  3,  0   )   byref  ->   r9        
;  V19 tmp2         [V19,T16] (  3,  0   )   byref  ->   r9        
;  V20 tmp3         [V20,T20] (  3,  0   )     int  ->   r8        
;* V21 tmp4         [V21    ] (  0,  0   )   byref  ->  zero-ref    "Inlining Arg"
;* V22 tmp5         [V22    ] (  0,  0   )     int  ->  zero-ref    "Inlining Arg"
;* V23 tmp6         [V23    ] (  0,  0   )   short  ->  zero-ref    "Inlining Arg"
;  V24 tmp7         [V24,T21] (  2,  0   )   byref  ->  rcx         single-def "Inlining Arg"
;  V25 tmp8         [V25,T22] (  2,  0   )   byref  ->  rdx         single-def "Inlining Arg"
;  V26 tmp9         [V26,T26] (  2,  0   )    long  ->   r8         "Inlining Arg"
;  V27 tmp10        [V27,T23] (  2,  0   )   byref  ->   r8         "Inlining Arg"
;  V28 tmp11        [V28,T24] (  2,  0   )   byref  ->   r8         "Inlining Arg"
;* V29 tmp12        [V29    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg"
;* V30 tmp13        [V30    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg"
;  V31 tmp14        [V31,T10] (  4,  0   )   byref  ->  rsi         "Inlining Arg"
;  V32 tmp15        [V32,T11] (  4,  0   )   byref  ->  rax         "Inlining Arg"
;* V33 tmp16        [V33    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg"
;* V34 tmp17        [V34    ] (  0,  0   )  simd32  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 40

G_M34983_IG01:
       push     r15
       push     r14
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
       vzeroupper 
       mov      rsi, rcx
       mov      edi, edx
       mov      ebx, r8d
						;; size=23 bbWeight=1    PerfScore 8.00
G_M34983_IG02:
       movzx    rcx, di
       movzx    r8, bx
       cmp      ecx, r8d
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (cmp: 1 ; jcc erratum) 32B boundary ...............................
       je       G_M34983_IG16
						;; size=16 bbWeight=1    PerfScore 1.75
G_M34983_IG03:
       lea      rcx, bword ptr [rsi+0CH]
       mov      r8d, dword ptr [rsi+08H]
       movsx    rdx, di
       call     [System.SpanHelpers:IndexOfValueType(byref,short,int):int]
       mov      ebp, eax
       test     ebp, ebp
       jge      SHORT G_M34983_IG06
						;; size=24 bbWeight=1.00 PerfScore 7.25
G_M34983_IG04:
       mov      rax, rsi
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (mov: 2) 32B boundary ...............................
						;; size=3 bbWeight=0.90 PerfScore 0.22
G_M34983_IG05:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       ret      
						;; size=16 bbWeight=0.90 PerfScore 4.71
G_M34983_IG06:
       mov      ecx, dword ptr [rsi+08H]
       sub      ecx, ebp
       mov      r14d, ecx
       mov      ecx, dword ptr [rsi+08H]
       call     System.String:FastAllocateString(int):System.String
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (call: 2 ; jcc erratum) 32B boundary ...............................
       mov      r15, rax
       test     ebp, ebp
       jg       G_M34983_IG18
						;; size=27 bbWeight=0.10 PerfScore 0.72
G_M34983_IG07:
       mov      eax, ebp
       lea      rax, bword ptr [rsi+2*rax+0CH]
       cmp      byte  ptr [r15], r15b
       mov      edx, ebp
       lea      rdx, bword ptr [r15+2*rdx+0CH]
       xor      ecx, ecx
; ............................... 32B boundary ...............................
       cmp      dword ptr [rsi+08H], 16
       jl       G_M34983_IG13
       movzx    r8, di
       imul     r8d, r8d, 0x10001
       vmovd    xmm0, r8d
       vpbroadcastd ymm0, ymm0
       movzx    r8, bx
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (movzx: 3) 32B boundary ...............................
       imul     r8d, r8d, 0x10001
       vmovd    xmm1, r8d
       vpbroadcastd ymm1, ymm1
       cmp      r14, 16
       jbe      SHORT G_M34983_IG09
       add      r14, -16
						;; size=81 bbWeight=0    PerfScore 0.00
G_M34983_IG08:
       lea      r8, bword ptr [rax+2*rcx]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (lea: 2) 32B boundary ...............................
       vmovupd  ymm2, ymmword ptr[r8]
       vpcmpeqw ymm3, ymm2, ymm0
       vpand    ymm4, ymm3, ymm1
       vpandn   ymm2, ymm3, ymm2
       vpor     ymm2, ymm4, ymm2
       lea      r8, bword ptr [rdx+2*rcx]
       vmovupd  ymmword ptr[r8], ymm2
; ............................... 32B boundary ...............................
       add      rcx, 16
       cmp      rcx, r14
       jb       SHORT G_M34983_IG08
						;; size=43 bbWeight=0    PerfScore 0.00
G_M34983_IG09:
       mov      ecx, dword ptr [rsi+08H]
       add      ecx, -16
       add      rsi, 12
       lea      rsi, bword ptr [rsi+2*rcx]
       vmovupd  ymm2, ymmword ptr[rsi]
       vpcmpeqw ymm3, ymm2, ymm0
       vpand    ymm0, ymm3, ymm1
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (vpand: 3) 32B boundary ...............................
       vpandn   ymm1, ymm3, ymm2
       vpor     ymm2, ymm0, ymm1
       lea      rax, bword ptr [r15+0CH]
       lea      rax, bword ptr [rax+2*rcx]
       vmovupd  ymmword ptr[rax], ymm2
       jmp      SHORT G_M34983_IG14
						;; size=48 bbWeight=0    PerfScore 0.00
G_M34983_IG10:
       movzx    r8, word  ptr [rax+2*rcx]
       lea      r9, bword ptr [rdx+2*rcx]
; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (lea: 2) 32B boundary ...............................
       movzx    r10, di
       cmp      r8d, r10d
       je       SHORT G_M34983_IG11
       jmp      SHORT G_M34983_IG12
						;; size=20 bbWeight=0    PerfScore 0.00
G_M34983_IG11:
       movzx    r8, bx
						;; size=4 bbWeight=0    PerfScore 0.00
G_M34983_IG12:
       mov      word  ptr [r9], r8w
       inc      rcx
						;; size=7 bbWeight=0    PerfScore 0.00
G_M34983_IG13:
       cmp      rcx, r14
       jb       SHORT G_M34983_IG10
						;; size=5 bbWeight=0    PerfScore 0.00
G_M34983_IG14:
       mov      rax, r15
; ............................... 32B boundary ...............................
						;; size=3 bbWeight=0    PerfScore 0.00
G_M34983_IG15:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
       pop      r15
       ret      
						;; size=16 bbWeight=0    PerfScore 0.00
G_M34983_IG16:
       mov      rax, rsi
						;; size=3 bbWeight=0    PerfScore 0.00
G_M34983_IG17:
       vzeroupper 
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r14
; ............................... 32B boundary ...............................
       pop      r15
       ret      
						;; size=16 bbWeight=0    PerfScore 0.00
G_M34983_IG18:
       cmp      byte  ptr [r15], r15b
       lea      rcx, bword ptr [r15+0CH]
       lea      rdx, bword ptr [rsi+0CH]
       mov      r8d, ebp
       add      r8, r8
       call     [System.Buffer:Memmove(byref,byref,long)]
       jmp      G_M34983_IG07
						;; size=28 bbWeight=0    PerfScore 0.00

One thing that I noticed is in .NET 6, Replace is optimized (I believe with QuickJitLoopBody), but in .NET 7, it goes through tiering.

We can only disable dynamic PGO (D-PGO), not the static PGO. Correct?

DOTNET_JitDisablePGO=1 should disable both. Static PGO specifically can be disabled by simply using DOTNET_ReadyToRun=0

What is the meaning of “edge weights are invalid”?

Nvm, it’s just a sign that JIT made some mistakes calculating edges’ weights - it happens in many cases.

@DrewScoggins or @EgorBo can you help point me at recent results for our perf test here?

Here is the link: https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/TestHistoryIndexIndex.html You can open any machine, e.g. Ubuntu x64 and then find the benchmark you need via Ctrl+F, e.g. https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2Frefs%2Fheads%2Fmain_x64_Windows 10.0.18362%2FSystem.Tests.Perf_String.Replace_Char(text%3A "Hello"%2C oldChar%3A 'a'%2C newChar%3A 'b').html