runtime: DoStackSnapshot after ICorProfilerInfo10::SuspendRuntime E_FAIL(-es) almost always for simple CPU consuming app (w/o switches)

Hi!

After discussing some issues with async stack walking, I decided to try ICorProfilerInfo10::SuspendRuntime() followed by DoStackSnapshot(). It works good on all examples I have tested (Yay!), but I found the following:

DoStackSnapshot() returns E_FAIL for 90% samples if the program calculates something in a loop (prime numbers, for ex.) and doesn’t use anything else:

using System;
using System.Threading;

namespace calculate_prime
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Current ManagedThreadId: {0}", Thread.CurrentThread.ManagedThreadId);
            Console.WriteLine(getNthPrime(30000));
        }

        private static uint getNthPrime(uint n)
        {
            for (uint i = 0; ; ++i)
                if (isPrime(i))
                    if (n-- == 0)
                        return i;
        }

        private static bool isPrime(uint n)
        {
            for (uint i = 2; i < n / 2; ++i)
                if (n % i == 0)
                    return false;
            return true;
        }
    }
}

In this case, DoStackSnapshot() got called 1694 times (counting the main thread only), but failed in 1527 cases with E_FAIL.

I tried to debug it a bit, and it seems for me that in (almost?) all of these cases there was a RedirectedThreadFrame on the top, and StackFrameIterator::Init() (here https://github.com/dotnet/coreclr/blob/aed84886b8d777168f53fae4da15c7af5f7572da/src/vm/stackwalk.cpp#L956) failed to initialize, because StackFrameIterator::NextRaw() failed to step out of it there: https://github.com/dotnet/coreclr/blob/aed84886b8d777168f53fae4da15c7af5f7572da/src/vm/stackwalk.cpp#L2666

clrstack at that moment:

(lldb) clrstack
OS Thread Id: 0x522e (1)
        Child SP               IP Call Site
00007FFE1D00AD80 00007f71a70699f3 [RedirectedThreadFrame: 00007ffe1d00ad80]
00007FFE1D00C550 00007F712BEE99C1 calculate_prime.Program.isPrime(UInt32)
00007FFE1D00C580 00007F712BEE9900 calculate_prime.Program.getNthPrime(UInt32) [/home/user/p/test/hello_csharp/calculate_prime/Program.cs @ 17]
00007FFE1D00C5B0 00007F712BEC7FCE calculate_prime.Program.Main(System.String[]) [/home/user/p/test/hello_csharp/calculate_prime/Program.cs @ 11]
00007FFE1D00CA70 00007f71a5089cc3 [GCFrame: 00007ffe1d00ca70]
00007FFE1D00D318 00007f71a5089cc3 [GCFrame: 00007ffe1d00d318]
(lldb) bt
* thread dotnet/coreclr#1, name = 'dotnet'
  * frame #0: 0x00007f71a70699f3 libpthread.so.0`__pthread_cond_wait at futex-internal.h:88
    frame dotnet/coreclr#1: 0x00007f71a70699d8 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:502
    frame dotnet/coreclr#2: 0x00007f71a70698f8 libpthread.so.0`__pthread_cond_wait(cond=0x0000000001524228, mutex=0x0000000001524200) at pthread_cond_wait.c:655
    frame dotnet/coreclr#3: 0x00007f71a55d0c75 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x0000000001524200, dwTimeout=4294967295, ptwrWakeupReason=0x00007ffe1d00a4b4, pdwSignaledObject=0x00007ffe1d00a4b0) at synchmanager.cpp:479:28
    frame dotnet/coreclr#4: 0x00007f71a55cff4e libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000001533d00, pthrCurrent=0x0000000001523fc0, dwTimeout=4294967295, fAlertable=false, fIsSleep=false, ptwrWakeupReason=0x00007ffe1d00a7b0, pdwSignaledObject=0x00007ffe1d00a7dc) at synchmanager.cpp:302:22
    frame dotnet/coreclr#5: 0x00007f71a55e20e7 libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x0000000001523fc0, nCount=1, lpHandles=0x00007ffe1d00a988, bWaitAll=NO, dwMilliseconds=4294967295, bAlertable=NO, bPrioritize=NO) at wait.cpp:640:45
    frame dotnet/coreclr#6: 0x00007f71a55e2c94 libcoreclr.so`::WaitForSingleObjectEx(hHandle=0x0000000000000098, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:139:13
    frame dotnet/coreclr#7: 0x00007f71a501b940 libcoreclr.so`CLREventWaitHelper2(handle=0x0000000000000098, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:377:12
    frame dotnet/coreclr#8: 0x00007f71a501b7ea libcoreclr.so`CLREventWaitHelper(this=0x00007ffe1d00aa80, pParam=0x00007ffe1d00aa90)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:402:26
    frame dotnet/coreclr#9: 0x00007f71a501a06d libcoreclr.so`CLREventWaitHelper(handle=0x0000000000000098, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:404:5
    frame dotnet/coreclr#10: 0x00007f71a5019fbe libcoreclr.so`CLREventBase::WaitEx(this=0x0000000001585130, dwMilliseconds=4294967295, mode=WaitMode_None, syncState=0x0000000000000000) at synch.cpp:471:20
    frame dotnet/coreclr#11: 0x00007f71a5019dff libcoreclr.so`CLREventBase::Wait(this=0x0000000001585130, dwMilliseconds=4294967295, alertable=NO, syncState=0x0000000000000000) at synch.cpp:417:12
    frame dotnet/coreclr#12: 0x00007f71a51b3d75 libcoreclr.so`GCEvent::Impl::Wait(this=0x0000000001585130, timeout=4294967295, alertable=false) at gcenv.os.cpp:1215:24
    frame dotnet/coreclr#13: 0x00007f71a51b3135 libcoreclr.so`GCEvent::Wait(this=0x00000000014d6610, timeout=4294967295, alertable=false) at gcenv.os.cpp:1293:20
    frame dotnet/coreclr#14: 0x00007f71a52dd310 libcoreclr.so`WKS::GCHeap::WaitUntilGCComplete(this=0x00000000014e1120, bConsiderGCStart=false) at gcee.cpp:312:40
    frame dotnet/coreclr#15: 0x00007f71a5021327 libcoreclr.so`Thread::RareDisablePreemptiveGC(this=0x0000000001589130) at threadsuspend.cpp:2576:60
    frame dotnet/coreclr#16: 0x00007f71a4cb24fe libcoreclr.so`Thread::DisablePreemptiveGC(this=0x0000000001589130) at threads.h:2036:13
    frame dotnet/coreclr#17: 0x00007f71a5022869 libcoreclr.so`Thread::PulseGCMode(this=0x0000000001589130) at threadsuspend.cpp:2961:9
    frame dotnet/coreclr#18: 0x00007f71a5027788 libcoreclr.so`HandleGCSuspensionForInterruptedThread(interruptedContext=0x00007ffe1d00bad0) at threadsuspend.cpp:6714:18
    frame dotnet/coreclr#19: 0x00007f71a551dcb3 libcoreclr.so`inject_activation_handler(code=34, siginfo=0x00007ffe1d00c0f0, context=0x00007ffe1d00bfc0) at signal.cpp:650:13
    frame dotnet/coreclr#20: 0x00007f71a706e890 libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1
    frame dotnet/coreclr#21: 0x00007f712bee99c1
    frame dotnet/coreclr#22: 0x00007f712bee9900
    frame dotnet/coreclr#23: 0x00007f712bec7fce
    frame dotnet/coreclr#24: 0x00007f71a5089cc3 libcoreclr.so`CallDescrWorkerInternal at unixasmmacrosamd64.inc:866
    frame dotnet/runtime#3858: 0x00007f71a4e78fb2 libcoreclr.so`CallDescrWorkerWithHandler(pCallDescrData=0x00007ffe1d00c938, fCriticalCall=NO) at callhelpers.cpp:70:5
    frame dotnet/runtime#3859: 0x00007f71a4e7a4e8 libcoreclr.so`MethodDescCallSite::CallTargetWorker(this=0x00007ffe1d00caa0, pArguments=0x00007ffe1d00ca30, pReturnValue=0x0000000000000000, cbReturnValue=0) at callhelpers.cpp:604:9
    frame dotnet/coreclr#27: 0x00007f71a4cbeb83 libcoreclr.so`MethodDescCallSite::Call(this=0x00007ffe1d00caa0, pArguments=0x00007ffe1d00ca30) at callhelpers.h:468:9
    frame dotnet/runtime#3860: 0x00007f71a50b675e libcoreclr.so`RunMain(this=0x00007ffe1d00cbf8, pParam=0x00007ffe1d00ccf0)::$_1::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const::'lambda'(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*)::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const at assembly.cpp:1556:25
    frame dotnet/runtime#3861: 0x00007f71a50b3b19 libcoreclr.so`RunMain(this=0x00007ffe1d00cce0, __EXparam=0x00007ffe1d00ccf0)::$_1::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const at assembly.cpp:1571:5
    frame dotnet/runtime#3862: 0x00007f71a50b392c libcoreclr.so`RunMain(pFD=0x00007f712c1197d8, numSkipArgs=1, piRetVal=0x00007ffe1d00cdec, stringArgs=0x00007ffe1d00d2d8) at assembly.cpp:1571:5
    frame dotnet/runtime#3863: 0x00007f71a50b3dc2 libcoreclr.so`Assembly::ExecuteMainMethod(this=0x00000000016e2100, stringArgs=0x00007ffe1d00d2d8, waitForOtherThreads=YES) at assembly.cpp:1681:18
    frame dotnet/coreclr#32: 0x00007f71a4cb8095 libcoreclr.so`CorHost2::ExecuteAssembly(this=0x00000000014adca0, dwAppDomainId=1, pwzAssemblyPath=u"/home/user/p/test/hello_csharp/calculate_prime/bin/Debug/netcoreapp3.0/calculate_prime.dll", argc=0, argv=0x0000000000000000, pReturnValue=0x00007ffe1d00d490) at corhost.cpp:460:39
    frame dotnet/runtime#3864: 0x00007f71a4c4cccf libcoreclr.so`::coreclr_execute_assembly(hostHandle=0x00000000014adca0, domainId=1, argc=0, argv=0x0000000000000000, managedAssemblyPath="/home/user/p/test/hello_csharp/calculate_prime/bin/Debug/netcoreapp3.0/calculate_prime.dll", exitCode=0x00007ffe1d00d490) at unixinterface.cpp:412:24
    frame dotnet/runtime#3865: 0x00007f71a5c931c3 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
    frame dotnet/coreclr#35: 0x00007f71a5c93a1c libhostpolicy.so`corehost_main + 204
    frame dotnet/coreclr#36: 0x00007f71a5ef7b10 libhostfxr.so`___lldb_unnamed_symbol383$$libhostfxr.so + 1536
    frame dotnet/runtime#3866: 0x00007f71a5ef66a3 libhostfxr.so`___lldb_unnamed_symbol381$$libhostfxr.so + 691
    frame dotnet/coreclr#38: 0x00007f71a5ef16f3 libhostfxr.so`hostfxr_main_startupinfo + 147
    frame dotnet/runtime#3867: 0x000000000040c441 dotnet`___lldb_unnamed_symbol106$$dotnet + 785
    frame dotnet/runtime#3868: 0x000000000040c9fe dotnet`___lldb_unnamed_symbol107$$dotnet + 142
    frame dotnet/coreclr#41: 0x00007f71a6149b97 libc.so.6`__libc_start_main(main=(dotnet`___lldb_unnamed_symbol107$$dotnet), argc=2, argv=0x00007ffe1d00d9d8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffe1d00d9c8) at libc-start.c:310
    frame dotnet/coreclr#42: 0x0000000000402f40 dotnet`_start + 41

my debug output:

StackFrameIterator::Init m_crawl.pFrame = 00007FFE1D00AD80
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 frameState SFITER_INITIAL_NATIVE_CONTEXT <-- first iteration
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 fStop 0 NextRaw() = ...
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 m_crawl.pFunc = 0000000000000000
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 frameState SFITER_FRAME_FUNCTION <-- second iteration
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 fStop 0 NextRaw() = ...
StackFrameIterator::NextRaw m_crawl.pFrame 00007FFE1D00AD80 uFrameAttribs 8 m_crawl.isFirst = 1 adr = 00007F712BEE99C1 <-- and now it will fail to avoid deadlock
StackFrameIterator::Filter fStop 0 NextRaw() = 2
DoStackSnapshotHelper pThreadToSnapshot=0000000001589130[curr 0]} pctxSeed=0000000000000000 hr=80004005 <-- E_FAIL

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 1
  • Comments: 19 (19 by maintainers)

Most upvoted comments

Thanks @k15tfu for the callstacks. I haven’t forgotten about this, I’ve just been busy with some other things lately. I hope to get to this relatively soon

@davmason One more thing. SuspendRuntime sometimes returns error codes:

* CORPROF_E_RUNTIME_UNINITIALIZED -- probably because I call it at a very early stage. What's the moment when I can start doing this? The first AppDomainCreationStarted or so?

We return CORPROF_E_RUNTIME_UNINITIALIZED when the runtime isn’t initialized yet. The specific check we do is for g_fEEStarted which gets set to true shortly after setting up the default appdomain. If you start sampling at the first appdomain created notification you’ll probably get a few CORPROF_E_RUNTIME_UNINITIALIZED before you can start sampling.

There’s nothing wrong with calling SuspendRuntime before initialization though, it won’t cause any issues. I don’t see any reason to try to synchronize it, just call SuspendRuntime and once it stops returning CORPROF_E_RUNTIME_UNINITIALIZED your app is up and running. Just ignore the error before that.

* CORPROF_E_SUSPENSION_IN_PROGRESS -- Does it happen because of GC? If so, is it indended to do DoStackSnapshot() for all threads in this case, or I should just skip this sample and go futher?

That means something else is already attempting to suspend the runtime. It could be the GC, or we have various other things that can cause a runtime suspension, although they are far more rare than GCs. You can either ignore that sample or block in RuntimeSuspendStarted when you get notified about the other suspension to do your sampling. That would effectively hijack the suspension. I don’t think there should be any issue with hijacking the suspension that way, other than making the GC/whatever else take longer than it would normally.

Fixed in master with dotnet/coreclr#26985 and ported to 3.1 with dotnet/coreclr#27300

@k15tfu if you have time, can you check that the change in dotnet/coreclr#26985 fixes this issue for you? When I run your sample program locally it works now.