runtime: String::Format Memory Optimization Degrades Performance under RuyJIT 64
The String::Format methods have been updated some time ago to eliminate excessive array object allocations and reduce memory traffic.
Was:
public static String Format(String format, Object arg0) {
return Format(null, format, new Object[] {arg0}); }
Now:
public static string Format(string format, object arg0) {
return string.FormatHelper((IFormatProvider) null, format, new ParamsArray(arg0)); }
Unfortunately, with RyuJIT x64 this makes simple functions execute up to 10 (ten!) times slower than before in code paths which do not call String::Format.
Here is a code sample.
[MethodImpl(MethodImplOptions.NoInlining)]
public static int FastFunctionNotCallingStringFormat(int param)
{
if(param < 0)
throw new Exception(string.Format("We do not like the value {0:N0}.", param));
if(param == int.MaxValue)
throw new Exception(string.Format("{0:N0} is maxed out.", param));
if(param > int.MaxValue / 2)
throw new Exception(string.Format("We do not like the value {0:N0} either.", param));
return param * 2;
}
[MethodImpl(MethodImplOptions.NoInlining)]
public static int FastFunctionNotHavingStringFormat(int param)
{
if(param < 0)
throw new ArgumentOutOfRangeException("param", "We do not like this value.");
if(param == int.MaxValue)
throw new ArgumentOutOfRangeException("param", "Maxed out.");
if(param > int.MaxValue / 2)
throw new ArgumentOutOfRangeException("param", "We do not like this value either.");
return param * 2;
}
private static int Main()
{
int sum = 0;
for(int a = int.MaxValue / 2; a --> 0;)
{
sum = unchecked(sum + FastFunctionNotHavingStringFormat(a));
sum = unchecked(sum + FastFunctionNotCallingStringFormat(a));
}
Console.WriteLine(sum);
return 0;
}
A real-life example of this mock-up would be a collection object which has range checks with diagnostics in its
get_Itemindexer. That’s how we’ve hit it: a simple enough indexer suddenly showed up as the hottest function in the profile.
Notice that String::Format is never called within the course of execution. Supposedly, these two functions should run at the same speed.
Now that’s what it looks like in the profiler:

None of these functions calls formatting, but the one which merely has it on unused code paths takes tenfold (well, almost) time to execute.
If we go even deeper, the code which takes up the extra time is:
lea rdi, ptr [rsp+0x28]
mov ecx, 0x20
xor eax, eax
rep stosd dword ptr [rdi]
It is located in the function head, before any user code.
That’s like doing ZeroMemory on the stack frame, and quite an amount of that.
From this I can guess that:
- JIT has inlined the
String::Formatfunction into theFastFunctionNotCallingStringFormatbody. System.ParamsArraynow lives on the stack frame ofFastFunctionNotCallingStringFormat.- Even though the scopes of the three
System.ParamsArraycopies do not intersect, each seems to be getting its own space in the stack frame. The amount is0x20 * sizeof(DWORD), or128bytes, which is enough for entire4 * sizeof(ParamsArray). - Even though these
System.ParamsArrays are never used or created in the actually executed code,ZeroMemoryfor them has been pulled up to the head of the function, and it affects any call to this function.
So string-formats which are only sitting there but are never called do suddenly have a cost.
UPD: Mostly the same happens in 32-bit runtimes, even the relative speeds are the same, even though both lower somewhat.
I hope for the following improvements:
- Costly
ZeroMemorycould be moved closer to the usage; the JITter might even know when it is safe to omit it altogether, but the point is to avoid this until (and unless) the structure is put to use. - Memory could be reused for structures whose lifetimes never intersect (and even are of the same type).
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Reactions: 14
- Comments: 22 (17 by maintainers)
Lolz,
RegisterForFullGCNotificationhas more exception related code than actual code.@omariom it’s related to first class structs but not handled by the current set of changes going in. It boils down to how we represent and report the struct for the GC. Today we’re treating it as untracked and thus zeroing it at entry. If we could instead track the liveness of the struct - or the underlying GC fields - we could then report a more minimal GC lifetimes and avoid the zeroing. The other issue of keeping three copies is a separate issue in that we don’t implement stack packing. @swaroop-sridhar can you take a peek at this?