runtime: Marshal.AllocHGlobal/FreeHGlobal is ~150x slower in .NET than legacy mono on device (tvOS)

Description

Calling Marshal.AllocHGlobal / Marshal.FreeHGlobal is ~150x slower in .NET compared to legacy Mono when running on a tvOS device.

Sample test code: https://gist.github.com/rolfbjarne/b22b844e6f351ad40c4f30e20a2a36d8

Outputs something like this with legacy Mono (Xamarin.iOS from d16-10):

[...]
Elapsed: 17.03 seconds Counter: 265,779,440.00 Iterations per second: 15,608,870
Elapsed: 18.03 seconds Counter: 281,406,670.00 Iterations per second: 15,608,886
Elapsed: 19.03 seconds Counter: 297,098,501.00 Iterations per second: 15,612,343

which is roughly 15.5M calls to Marshal.AllocHGlobal+FreeHGlobal per second.

Now in .NET I get this:

[...]
Elapsed: 17.10 seconds Counter: 1,773,883.00 Iterations per second: 103,745
Elapsed: 18.10 seconds Counter: 1,878,446.00 Iterations per second: 103,784
Elapsed: 19.10 seconds Counter: 1,982,100.00 Iterations per second: 103,771

that’s roughly 103k calls to Marshal.AllocHGlobal+FreeHGlobal per second; ~150x slower.

This is on an Apple TV 4K from 2017.

There’s a difference in the simulator too, just not so stark (on an iMac Pro)

Legacy Mono:

[...]
Elapsed: 12.02 seconds Counter: 165,442,863.00 Iterations per second: 13,759,790
Elapsed: 13.02 seconds Counter: 179,568,197.00 Iterations per second: 13,787,606
Elapsed: 14.02 seconds Counter: 193,717,056.00 Iterations per second: 13,812,942

and with .NET:

[...]
Elapsed: 12.03 seconds Counter: 39,917,875.00 Iterations per second: 3,317,120
Elapsed: 13.03 seconds Counter: 43,252,347.00 Iterations per second: 3,318,392
Elapsed: 14.04 seconds Counter: 46,617,318.00 Iterations per second: 3,321,424

so ~4x slower.

I profiled the .NET version on device using instruments: Marshal.trace.zip

Here’s a preview:

Screen Shot 2021-09-10 at 15 31 11

It seems most of the time is spent inside mono_threads_enter_gc_safe_region_unbalanced.

This function isn’t even called in legacy Mono.

Here’s an Instruments trace: MarshalMono.trace.zip

and a preview:

Screen Shot 2021-09-10 at 15 35 25

I don’t know if this applies to other platforms as well, I only tested tvOS.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 1
  • Comments: 18 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I think the root cause of this problem is the high overhead implementation strategy used for PInvoke transitions on tvOS. This high overhead is a problem for every other PInvoke. For example, globalization PInvokes will hit it too.

I’ve done some experiment locally that ensures registers are saved to stack and then saves only part of the context on the thread state transition. Saves some memory copying. It would need a lot of polishing and validation to ensure it does not break anything. Mono ARM64 gets Iterations per second: 8,807,528 with the changes, or about 28% improvement.

Not sure if I can get it ready anytime soon but here’s a gist of what I was testing:

  • The marshalling method already saves the LMF structure (last managed frame) which contains callee saved registers. I added saving of the reference-type arguments into stack variables. This ensures that all the passed parameters (such as SafeHandle) exist somewhere on the stack before the native method is called or before the GC transition frame is established.
  • Since everything is on the stack there’s no need to accurately capture all registers since variables/arguments will need to be spilled to the stack. Thus we need to capture only IP/SP (ie. thread_state_init can be simplified). This saves about 12% of the run time.
  • The copy_stack method is no longer needed in the P/Invoke flow since everything is on the stack already by the time mono_threads_enter_gc_safe_region_unbalanced is called and the stack is not unwound. This saves another ±15% of the run time.
  • LLVM-only mode is missing support for save_lmf logic. It should emit llvm.eh.unwind.init intrinsic to spill the callee saved registers.

I think malloc is an interesting case where common implementations try to ensure it is as fast as possible. If you look at the various implementations available, most only take locks or do syscalls in the rare edge case, not in the common path.

That is, these calls that make it “incompatible” with SuppressGCTransition largely only happen when the underlying heap needs to be created or expanded. Otherwise, small allocations avoid all of this as do many medium sized allocations.

It would perhaps be interesting to see if there was something we could do that could help support this kind of scenario.

here is also NativeMemory.Alloc which should be faster than AllocHGlobal

On Unix systems Marshal.AllocHGlobal is just wrapper around NativeMemory.Alloc. On legacy Mono it was an icall, now it’s a P/Invoke. It doesn’t have the SuppressGCTransition attribute applied so it goes through the expensive GC transitions.