runtime: Exception throw/catch impactfully slower when debugger attached

Repro:

using System;
using System.Diagnostics;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { throw new Exception(); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

When I run that with ctrl-F5, I get output like:

00:00:00.0004349
00:00:00.0004371
00:00:00.0004347
00:00:00.0004366
00:00:00.0004440

When I run that with F5, I get output like:

00:00:01.9491196
00:00:01.8550998
00:00:01.8392192
00:00:01.8323522
00:00:01.8168992

That’s a 4500x slowdown, with every exception throw/catch consuming ~20ms. And this does scale with exception count: if I increase the repro to 1000 exceptions, without the debugger it takes ~0.004s and with the debugger it takes ~18s.

While ideally well-behaved apps wouldn’t throw lots of exceptions, this has shown to be a significant cause of slowdown for real apps while being developed, with a noticeable impact on developer inner loop performance especially if the exceptions occur at app startup.

cc: @noahfalk, @mikem8361, @gregg-miskelly

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 3
  • Comments: 19 (19 by maintainers)

Most upvoted comments

I spent a little time with a profiler to assess where we are spending our time. I profiled Steve’s example that uses depth 10 callstacks without async:

using System;
using System.Diagnostics;
using System.Threading.Tasks;

while (true)
{
    var sw = Stopwatch.StartNew();
    for (int i = 0; i < 100; i++)
    {
        try { Run(10); } catch { }
    }
    Console.WriteLine(sw.Elapsed);
}

static void Run(int depth)
{
    if (depth > 0)
    {
        Run(depth - 1);
    }

    throw new Exception();
}

On my particular machine (Windows, .Net Core 3.1.11, VS 16.8.4) using Ctrl-F5 is ~1.5ms per iteration of the outer while loop and F5 is ~1.5s, so about a 1000x overhead for me. During the 15ms handling of each exception there are 3 events dispatched from the runtime to the debugger:

  1. First chance exception - sent when the exception is thrown
  2. User first chance exception - sent when 1st pass exception dispatch first encounters a stack frame marked as user-code
  3. Catch handler found - send when 1st pass exception dispatch locates the frame with the catch handler

This shows the breakdown of the ~15ms window for one exception image

Each event logically traverses across 3 components:

  1. It starts in the runtime by suspending all threads and then transmitting a message over an IPC channel. This part of the work is in the white space of the graph above.
  2. The message is received by the mscordbi library running in the msvsmon process. This part of the work is the yellow-green DBI callback thread. Most of perf cost goes to doing a stackwalk. The purpose of the stackwalk is to map a frame pointer to an ICorDebugFrame object.
  3. mscordbi invokes a callback implemented by vsdebugeng.impl. This is where Visual Studio specific logic runs to determine what to do with the exception. vsdebugeng may also call back into mscorbi to request more information about the current state of the debuggee. This is shown in the dark green portion.

My take is there are two different approaches we could follow at this point:

  1. We could chip away at the problem trying to make small to medium sized code changes but leaving the overall design intact. This might take a week or more of dev time, and reduce the time taken to process an exception by 10-30%. The changes that would be most impactful would be figuring out how to eliminate an entire chunk of the work, for example suppressing one of the notifications in this scenario or avoiding the stackwalk. It is unlikely that we have trivial changes available with large returns as this scenario has already been incrementally improved on more than one occasion in the past.
  2. We could switch to a substantially different design for handling exceptions, most likely something that lets VS get their filtering logic running in-proc. This could involve at least few weeks of effort on both the CLR and VS side, but it would probably give us cost reductions >= 99%

Personally my preference is that if we are going to spend time on this we should look into option (2), the major changes. I worry that we don’t have much runway left to optimize incrementally given how easy it is for apps to start throwing yet more exceptions. I also expect building in-proc extensibility points opens the door to a bunch of other useful diagnostic scenarios. For example in the debugger space it could contribute towards other high performance events like conditional breakpoints, or being able to do more sophisticated analysis of rethrown exceptions for async. Outside debugging another example is having a dump collector that can capture dumps on 1st chance exceptions.

Priority-wise the runtime diagnostics team already has months of committed feature work for .NET 6 and a hearty backlog of bugs. I’d guess this would land in the grey area where we might be able to get it within the release timeframe but it is far enough down that we shouldn’t be making any promises at this point.

Thoughts on all of this? In particular @gregg-miskelly I’d be curious to hear what you think feasibility/interest is for doing a meatier change here?

Just out of curiosity, is the impact measurably smaller if the exception is thrown/caught in non-user code and with Just My Code enabled?

On my machine, with JMC on:

  • if the exception is thrown from non-user code but still caught by user code, the slowdown drops from ~4500x to ~3500x.
  • if the exception is both thrown and caught by non-user code, the slowdown drops from ~4500x to ~10x.

Rough guess maybe a 70% savings if this all held up?

I’ll take it 😄

We should be able to often eliminate the stack walk

Just to confirm you are saying it would be OK if we defined a new exception callback that didn’t provide the ICorDebugFrame argument or we kept the current callback and return null for that parameter? If you use that parameter for anything we probably still have to do a stackwalk in DBI.

if we took out the stack walks entirely, and reduced the number of stopping events from three to two, can you guess from the trace about what ‘x’ we are on? Is the main thing left synchronizing to send stopping events?

It is quite a bit more than I was anticipating we’d able to cut. I’m a little surprised we hadn’t already cut those in the past : ) Rough guess maybe a 70% savings if this all held up? If we still need the stackwalk but we can drop the first chance notification that might be 50% savings.

One nice part is that an API that disables debugger exception events is part of what we’d need to build to make the in-proc filtering case useful anyways.

I don’t have a repro that has as impressive a penalty as 4000x but I do have a real-world case.

It’s a champion scenario for the Azure EventHubs Processor SDK that has multi-threading and network calls.

In total the startup process throws ~180 exceptions (which are in fact 32 distinct exceptions but multiplied by layers of async state machines re-throwing) and with JMC off the startup time is:

Without debugger ~7.5sec With debugger ~20sec

What was surprising to me is how large the impact of the debugging is compared to actual networking. The app does 112 blob requests during startup + various TCP connections to EventHubs service so throwing an exception in an async method costs as much as making multiple real service calls over the network.

I’m happy to provide the app with all connection strings ready or record a profile.

Two thoughts –

  1. Before we get too far along attempting to improve things, it may be worth making Stephen’s app a bit more “real world” – A. Add background thread(s), as I suspect the time to reach GC Suspend is much better on this trivial app then in a real thing B. Make the call stacks deeper, as I suspect that stack walk time is going to be an important component for this, and stack walk with one function may be artificially cheap
  2. I haven’t looked at a trace, so these ideas may prove totally wrong, but I would guess improvements will fall into three buckets – A. Improvements that the runtime can do alone. For example, making the IPCs more efficient. B. Improvements that the runtime can do along with the debugger. For example, currently the runtime will walk the stack up front. The runtime can fix this, but VS/vsdbg at least has never tried to optimize our stack handling code to avoid walking the entire stack since the runtime already had done it. So this will require changes from both sides to see the benefit. C. Improvements that will require new ICorDebug APIs – I could imagine that getting down to the 10x-100x level would require a significantly different notification interface. Perhaps one where the notification is delivered without synchronizing, or maybe even one that is delivered in-process.