runtime: Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS)
Description
Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.
Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8
Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):
[...]
Elapsed: 17.03 seconds Counter: 265,779,440.00 Iterations per second: 15,608,870
Elapsed: 18.03 seconds Counter: 281,406,670.00 Iterations per second: 15,608,886
Elapsed: 19.03 seconds Counter: 297,098,501.00 Iterations per second: 15,612,343
which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.
Now in .NET I get this:
[...]
Elapsed: 17.10 seconds Counter: 1,773,883.00 Iterations per second: 103,745
Elapsed: 18.10 seconds Counter: 1,878,446.00 Iterations per second: 103,784
Elapsed: 19.10 seconds Counter: 1,982,100.00 Iterations per second: 103,771
that’s roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.
This is on an Apple TV 4K from 2017.
There’s a difference in the simulator too, just not so stark (on an iMac Pro)
Legacy Mono:
[...]
Elapsed: 12.02 seconds Counter: 165,442,863.00 Iterations per second: 13,759,790
Elapsed: 13.02 seconds Counter: 179,568,197.00 Iterations per second: 13,787,606
Elapsed: 14.02 seconds Counter: 193,717,056.00 Iterations per second: 13,812,942
and with .NET:
[...]
Elapsed: 12.03 seconds Counter: 39,917,875.00 Iterations per second: 3,317,120
Elapsed: 13.03 seconds Counter: 43,252,347.00 Iterations per second: 3,318,392
Elapsed: 14.04 seconds Counter: 46,617,318.00 Iterations per second: 3,321,424
so ~4x slower.
I profiled the .NET version on device using instruments: Marshal.trace.zip
Here’s a preview:
It seems most of the time is spent inside mono_threads_enter_gc_safe_region_unbalanced.
This function isn’t even called in legacy Mono.
Here’s an Instruments trace: MarshalMono.trace.zip
and a preview:
I don’t know if this applies to other platforms as well, I only tested tvOS.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 1
- Comments: 18 (16 by maintainers)
Commits related to this issue
- [tests] Add perf test for calling Marshal.AllocHGlobal/FreeHGlobal. Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project file, to cope with recent changes in our .NET support. ... — committed to rolfbjarne/xamarin-macios by rolfbjarne 3 years ago
- [tests] Add perf test for calling Marshal.AllocHGlobal/FreeHGlobal. (#12696) Also add a 'SupportedOSPlatformVersion' value to the .NET perftest project file, to cope with recent changes in our .NET ... — committed to xamarin/xamarin-macios by rolfbjarne 3 years ago
- [release/6.0-rc2] [MonoVM] Reduce P/Invoke GC transition asserts in release builds (#59269) Backport of #59029 Profiling shows that large part of the GC transition overhead (~30%) in #58939 is cau... — committed to dotnet/runtime by github-actions[bot] 3 years ago
I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.
I’ve done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets
Iterations per second: 8,807,528with the changes, or about 28% improvement.Not sure if I can get it ready anytime soon but here’s a gist of what I was testing:
SafeHandle) exist somewhere on the stack before the native method is called or before the GC transition frame is established.thread_state_initcan be simplified). This saves about 12% of the run time.copy_stackmethod is no longer needed in the P/Invoke flow since everything is on the stack already by the timemono_threads_enter_gc_safe_region_unbalancedis called and the stack is not unwound. This saves another ±15% of the run time.save_lmflogic. It should emitllvm.eh.unwind.initintrinsic to spill the callee saved registers.I think
mallocis an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.That is, these calls that make it “incompatible” with
SuppressGCTransitionlargely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.
On Unix systems
Marshal.AllocHGlobalis just wrapper aroundNativeMemory.Alloc. On legacy Mono it was an icall, now it’s a P/Invoke. It doesn’t have theSuppressGCTransitionattribute applied so it goes through the expensive GC transitions.