runtime: [Proposal] Change the default behavior for StackOverflowException

I would like to propose an unspecified modification to the CLR which allows code to be executed in a manner where a StackOverflowException will not cause the process to terminate. To get the discussion started, here are some possibilities for alternative behavior:

  1. Terminate the current thread.
  2. Terminate the current AppDomain.
  3. Provide an attribute like HandleProcessCorruptedStateExceptionsAttribute which allows specialized handlers to be added.
  4. Provide a static event somewhere, which allows the exception handler to execute on a different thread.

The primary use case for this I’ve encountered is recursive-descent algorithms for creating and visiting parse trees, e.g. in ANTLR. In these cases, I am able to write all of the behavior for a background thread such that immediate termination of a thread does not result in an invalid execution state. For example, no operation on the thread calls Monitor.Enter. These algorithms are performance-sensitive, so stack depth checking throughout the code is not a desired solution. Providing the ability to handle an exceptional stack overflow situation without terminating the process (e.g. Visual Studio, if this was running as an extension) would be a much improved situation.

About this issue

  • Original URL
  • State: open
  • Created 9 years ago
  • Reactions: 4
  • Comments: 36 (20 by maintainers)

Most upvoted comments

I want to resurrect the thread by adding that, besides the debuggability issue mentioned above, what’s more prominent for online services are that you would like to have the chance to dump some important information before the app crashes, you don’t always get to hook up the debugger and repro the issue

One of the worst things about StackOverflowException being unrecoverable is that it seems to make it also un-debuggable. If you get one of these in the Visual Studio debugger, it will inform you that you raised a SOE, but it doesn’t provide a stack trace. Can we please at the very least fix whatever’s causing that?

The solution to this is to change your code/algorithm from recursive to iterative. This gives you complete control over conditions of when and what happens when you exhaust your limits.

That would allow recursive algorithms to at least throw StackOverflowImmenientExcpetion

As a library developer you cannot protect your consumers from using you improperly. If the high level of stack use in your library is overly problematic, it’s likely that you’ll need to do the work to re-implement things iteratively. Unfortunately, there’s no magic here.

We’ve spent a great deal of effort over the years trying to make it possible for teams to be resilient these kinds of failures (hosting API changes, CriticalFinazerObjects etc etc). It’s potentially possible that one could write a collection of objects that can handle these challenges but it’s the classic “thar be dragons here”. IMO: In essentially all cases it’s better to re-architect your code than to expect this to work coherently.

While undergoing a stack overflow, your process is a hairs width from doing all sorts of Really, Really bad things™.

This is a rather bad argument. One of the CLR’s greatest features is the way it uses processor exceptions for things like dynamic reallocation of the evaluation stack and omitting NullReferenceException checks. If we weren’t a hair’s width from Really, Really Bad Things™ we would all be bored.

how many user objects are really designed to operate correctly if their execution is terminated while manipulating state?

[Subjective content here:] This is the wrong question. The better question is whether or not developers are out there who could write objects that are designed for these scenarios. While individual programming languages are an ideal platform for eliminating features because they could be abused or are likely to be misunderstood, this is not that place. The CLR should only eliminate the ability to control application behavior if its own internal state would be corrupted and (potentially) unrecoverable, or if the implementation cost of a feature exceeds the benefits it would provide.

[Objective:] In this case, the CLR already has implemented behavior allowing applications to recover (in some manner) from a StackOverflowException, so whether or not that behavior is ideal it would be possible to simply change the default initial execution environment for a managed process to enable this handling instead of the default options used today.

I can confirm that stack overflow debugging does not work, as described in this thread. RuntimeHelpers.EnsureSufficientExecutionStack seems to do nothing in my case. I’m currently in a very difficult situation since I have essentially no way of proceeding to debug.

Some ideas on how to make this better (includes ideas collected from other contributors):

  1. Add an opt-in 64 KB guard page region. This region would allow reasonably safe execution in case of stack overflow.
  2. Add an “I don’t care whatsoever” mode that just tries to limp along as far as possible ignoring all safety principles.
  3. Add an opt-in spare thread that can be activated in case of stack overflow. That thread would take care to at least write the stack out to the console or to a file.
  4. As stated above, fix the debugger. But that’s of no use in production.
  5. Print the stack to the console. Apparently, this feature exists but it never activates for me. I sometimes don’t even get the “Stack overflow” print message.
  6. Use an “alternate stack” as suggested in https://github.com/dotnet/runtime/issues/40302#issuecomment-1003840882.
  7. Add an “emergency mode” that probes for remaining stack space at each function call. If it falls below 64LB, throw. This overhead would be tolerable as part of an emergency mode useful for production troubleshooting and unit testing.
  8. Terminate the current thread to allow the process to keep limping along (“emergency mode” only). (Suggestion from OP.)
  9. Provide an attribute like HandleProcessCorruptedStateExceptionsAttribute which allows specialized handlers to be added. (Suggestion from OP.)
  10. Provide a static event somewhere, which allows the exception handler to execute on a different thread. (Suggestion from OP.)

It seems that (1) would be a great way to get a production-friendly solution. If a production crash happens, the developer would enable the special mode to obtain actionable logging. And most importantly, this would fix the outage quickly!

@jcdickinson by the time MessageBox gets to run, the stack will have unwound, allowing the garbage collector to reclaim memory.

Only not garbage collector, but stack will get unwind (GC works for heap only).

Still, what I’m saying is that you can never guarantee success of catch block. Never for any exception! Still because of this we don’t deny the catch block, do we?

GC is for heap only, but the stack unwinding will remove some GC roots.

@jcdickinson by the time MessageBox gets to run, the stack will have unwound, allowing the garbage collector to reclaim memory.

Java has some level of support for this, right? Is there something .NET could learn from that? If not, what is wrong with what Java does in .NET’s view?

@tannergooding Not sure what to say. This is not what happens to me, on a vanilla VS2017 installation, when I get a StackOverflowException. I end up with no stack trace, no ability to evaluate locals, and no useful information, and have to track down the problem manually.

In many cases, this is actually not working. It is common for this exception to actually kill the debugger.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Fri, Nov 17, 2017 at 9:17 PM, Tanner Gooding notifications@github.com wrote:

[image: image] https://user-images.githubusercontent.com/10487869/32964520-5f7701c2-cb88-11e7-9281-8fc202b96775.png

I’ve provided the picture above as a simple example.

The exception will be caught by a debugger and viewing either the Parallel Stacks or the Call Stack window will display the appropriate information.

However, the first place people may check ($exception.StackTrace) lists the stack trace as null. I agree with @masonwheeler https://github.com/masonwheeler that ensuring the exception stacktrace information is properly populated would be ideal.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dotnet/coreclr/issues/652#issuecomment-345339255, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHIs8LEfwecEKhR1tfUIzIh6GfkRZcUks5s3dvIgaJpZM4D7TXI .

tldr: We get this request a lot but we do not believe that it is a recoverable scenario. Further, making it “easy” to do the wrong thing seemed more harmful than tearing down from a dangerous situation.

While undergoing a stack overflow, your process is a hairs width from doing all sorts of Really, Really bad things™. For instance, causing any more stack to be consumed can start scribbling over heap space (since you’ve lost the hard guard page at the end of the stack). In addition as it’s basically an asynchronous exception, object state needs to be assumed trashed (how many user objects are really designed to operate correctly if their execution is terminated while manipulating state? And for value types it’s even worse, since their copy is non-atomic).

“Just re add the guard page and move on”, you may say. However, it’s not possible to restore the guard page until after you’ve decided how you’re dispatching the exception so you can unwind some frames and do your work. When you get the exception, you’re already in the last page and you can’t restore the frame until you’ve taken stuff off. And you still have to run user code in the guard page to figure out how to handle it. Sure, that could be solved by ignoring filters during Stack Overflow, but you’re just building special case on top of special case at that point. (stuff gets really weird when you have managed->native transitions there as well). And what about any other SEH handlers that have registered on your thread?

For scenarios that want to run code beyond this exception, the path forward is “host clr from native” or better, “move the disruptive chunk of code to another process”.