runtime: DoStackSnapshot after ICorProfilerInfo10::SuspendRuntime E_FAIL(-es) almost always for simple CPU consuming app (w/o switches)
Hi!
After discussing some issues with async stack walking, I decided to try ICorProfilerInfo10::SuspendRuntime() followed by DoStackSnapshot(). It works good on all examples I have tested (Yay!), but I found the following:
DoStackSnapshot() returns E_FAIL for 90% samples if the program calculates something in a loop (prime numbers, for ex.) and doesn’t use anything else:
using System;
using System.Threading;
namespace calculate_prime
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Current ManagedThreadId: {0}", Thread.CurrentThread.ManagedThreadId);
Console.WriteLine(getNthPrime(30000));
}
private static uint getNthPrime(uint n)
{
for (uint i = 0; ; ++i)
if (isPrime(i))
if (n-- == 0)
return i;
}
private static bool isPrime(uint n)
{
for (uint i = 2; i < n / 2; ++i)
if (n % i == 0)
return false;
return true;
}
}
}
In this case, DoStackSnapshot() got called 1694 times (counting the main thread only), but failed in 1527 cases with E_FAIL.
I tried to debug it a bit, and it seems for me that in (almost?) all of these cases there was a RedirectedThreadFrame
on the top, and StackFrameIterator::Init() (here https://github.com/dotnet/coreclr/blob/aed84886b8d777168f53fae4da15c7af5f7572da/src/vm/stackwalk.cpp#L956) failed to initialize, because StackFrameIterator::NextRaw() failed to step out of it there: https://github.com/dotnet/coreclr/blob/aed84886b8d777168f53fae4da15c7af5f7572da/src/vm/stackwalk.cpp#L2666
clrstack
at that moment:
(lldb) clrstack
OS Thread Id: 0x522e (1)
Child SP IP Call Site
00007FFE1D00AD80 00007f71a70699f3 [RedirectedThreadFrame: 00007ffe1d00ad80]
00007FFE1D00C550 00007F712BEE99C1 calculate_prime.Program.isPrime(UInt32)
00007FFE1D00C580 00007F712BEE9900 calculate_prime.Program.getNthPrime(UInt32) [/home/user/p/test/hello_csharp/calculate_prime/Program.cs @ 17]
00007FFE1D00C5B0 00007F712BEC7FCE calculate_prime.Program.Main(System.String[]) [/home/user/p/test/hello_csharp/calculate_prime/Program.cs @ 11]
00007FFE1D00CA70 00007f71a5089cc3 [GCFrame: 00007ffe1d00ca70]
00007FFE1D00D318 00007f71a5089cc3 [GCFrame: 00007ffe1d00d318]
(lldb) bt
* thread dotnet/coreclr#1, name = 'dotnet'
* frame #0: 0x00007f71a70699f3 libpthread.so.0`__pthread_cond_wait at futex-internal.h:88
frame dotnet/coreclr#1: 0x00007f71a70699d8 libpthread.so.0`__pthread_cond_wait at pthread_cond_wait.c:502
frame dotnet/coreclr#2: 0x00007f71a70698f8 libpthread.so.0`__pthread_cond_wait(cond=0x0000000001524228, mutex=0x0000000001524200) at pthread_cond_wait.c:655
frame dotnet/coreclr#3: 0x00007f71a55d0c75 libcoreclr.so`CorUnix::CPalSynchronizationManager::ThreadNativeWait(ptnwdNativeWaitData=0x0000000001524200, dwTimeout=4294967295, ptwrWakeupReason=0x00007ffe1d00a4b4, pdwSignaledObject=0x00007ffe1d00a4b0) at synchmanager.cpp:479:28
frame dotnet/coreclr#4: 0x00007f71a55cff4e libcoreclr.so`CorUnix::CPalSynchronizationManager::BlockThread(this=0x0000000001533d00, pthrCurrent=0x0000000001523fc0, dwTimeout=4294967295, fAlertable=false, fIsSleep=false, ptwrWakeupReason=0x00007ffe1d00a7b0, pdwSignaledObject=0x00007ffe1d00a7dc) at synchmanager.cpp:302:22
frame dotnet/coreclr#5: 0x00007f71a55e20e7 libcoreclr.so`CorUnix::InternalWaitForMultipleObjectsEx(pThread=0x0000000001523fc0, nCount=1, lpHandles=0x00007ffe1d00a988, bWaitAll=NO, dwMilliseconds=4294967295, bAlertable=NO, bPrioritize=NO) at wait.cpp:640:45
frame dotnet/coreclr#6: 0x00007f71a55e2c94 libcoreclr.so`::WaitForSingleObjectEx(hHandle=0x0000000000000098, dwMilliseconds=4294967295, bAlertable=NO) at wait.cpp:139:13
frame dotnet/coreclr#7: 0x00007f71a501b940 libcoreclr.so`CLREventWaitHelper2(handle=0x0000000000000098, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:377:12
frame dotnet/coreclr#8: 0x00007f71a501b7ea libcoreclr.so`CLREventWaitHelper(this=0x00007ffe1d00aa80, pParam=0x00007ffe1d00aa90)::$_1::operator()(CLREventWaitHelper(void*, unsigned int, int)::Param*) const at synch.cpp:402:26
frame dotnet/coreclr#9: 0x00007f71a501a06d libcoreclr.so`CLREventWaitHelper(handle=0x0000000000000098, dwMilliseconds=4294967295, alertable=NO) at synch.cpp:404:5
frame dotnet/coreclr#10: 0x00007f71a5019fbe libcoreclr.so`CLREventBase::WaitEx(this=0x0000000001585130, dwMilliseconds=4294967295, mode=WaitMode_None, syncState=0x0000000000000000) at synch.cpp:471:20
frame dotnet/coreclr#11: 0x00007f71a5019dff libcoreclr.so`CLREventBase::Wait(this=0x0000000001585130, dwMilliseconds=4294967295, alertable=NO, syncState=0x0000000000000000) at synch.cpp:417:12
frame dotnet/coreclr#12: 0x00007f71a51b3d75 libcoreclr.so`GCEvent::Impl::Wait(this=0x0000000001585130, timeout=4294967295, alertable=false) at gcenv.os.cpp:1215:24
frame dotnet/coreclr#13: 0x00007f71a51b3135 libcoreclr.so`GCEvent::Wait(this=0x00000000014d6610, timeout=4294967295, alertable=false) at gcenv.os.cpp:1293:20
frame dotnet/coreclr#14: 0x00007f71a52dd310 libcoreclr.so`WKS::GCHeap::WaitUntilGCComplete(this=0x00000000014e1120, bConsiderGCStart=false) at gcee.cpp:312:40
frame dotnet/coreclr#15: 0x00007f71a5021327 libcoreclr.so`Thread::RareDisablePreemptiveGC(this=0x0000000001589130) at threadsuspend.cpp:2576:60
frame dotnet/coreclr#16: 0x00007f71a4cb24fe libcoreclr.so`Thread::DisablePreemptiveGC(this=0x0000000001589130) at threads.h:2036:13
frame dotnet/coreclr#17: 0x00007f71a5022869 libcoreclr.so`Thread::PulseGCMode(this=0x0000000001589130) at threadsuspend.cpp:2961:9
frame dotnet/coreclr#18: 0x00007f71a5027788 libcoreclr.so`HandleGCSuspensionForInterruptedThread(interruptedContext=0x00007ffe1d00bad0) at threadsuspend.cpp:6714:18
frame dotnet/coreclr#19: 0x00007f71a551dcb3 libcoreclr.so`inject_activation_handler(code=34, siginfo=0x00007ffe1d00c0f0, context=0x00007ffe1d00bfc0) at signal.cpp:650:13
frame dotnet/coreclr#20: 0x00007f71a706e890 libpthread.so.0`___lldb_unnamed_symbol1$$libpthread.so.0 + 1
frame dotnet/coreclr#21: 0x00007f712bee99c1
frame dotnet/coreclr#22: 0x00007f712bee9900
frame dotnet/coreclr#23: 0x00007f712bec7fce
frame dotnet/coreclr#24: 0x00007f71a5089cc3 libcoreclr.so`CallDescrWorkerInternal at unixasmmacrosamd64.inc:866
frame dotnet/runtime#3858: 0x00007f71a4e78fb2 libcoreclr.so`CallDescrWorkerWithHandler(pCallDescrData=0x00007ffe1d00c938, fCriticalCall=NO) at callhelpers.cpp:70:5
frame dotnet/runtime#3859: 0x00007f71a4e7a4e8 libcoreclr.so`MethodDescCallSite::CallTargetWorker(this=0x00007ffe1d00caa0, pArguments=0x00007ffe1d00ca30, pReturnValue=0x0000000000000000, cbReturnValue=0) at callhelpers.cpp:604:9
frame dotnet/coreclr#27: 0x00007f71a4cbeb83 libcoreclr.so`MethodDescCallSite::Call(this=0x00007ffe1d00caa0, pArguments=0x00007ffe1d00ca30) at callhelpers.h:468:9
frame dotnet/runtime#3860: 0x00007f71a50b675e libcoreclr.so`RunMain(this=0x00007ffe1d00cbf8, pParam=0x00007ffe1d00ccf0)::$_1::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const::'lambda'(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*)::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const at assembly.cpp:1556:25
frame dotnet/runtime#3861: 0x00007f71a50b3b19 libcoreclr.so`RunMain(this=0x00007ffe1d00cce0, __EXparam=0x00007ffe1d00ccf0)::$_1::operator()(RunMain(MethodDesc*, short, int*, REF<PtrArray>*)::Param*) const at assembly.cpp:1571:5
frame dotnet/runtime#3862: 0x00007f71a50b392c libcoreclr.so`RunMain(pFD=0x00007f712c1197d8, numSkipArgs=1, piRetVal=0x00007ffe1d00cdec, stringArgs=0x00007ffe1d00d2d8) at assembly.cpp:1571:5
frame dotnet/runtime#3863: 0x00007f71a50b3dc2 libcoreclr.so`Assembly::ExecuteMainMethod(this=0x00000000016e2100, stringArgs=0x00007ffe1d00d2d8, waitForOtherThreads=YES) at assembly.cpp:1681:18
frame dotnet/coreclr#32: 0x00007f71a4cb8095 libcoreclr.so`CorHost2::ExecuteAssembly(this=0x00000000014adca0, dwAppDomainId=1, pwzAssemblyPath=u"/home/user/p/test/hello_csharp/calculate_prime/bin/Debug/netcoreapp3.0/calculate_prime.dll", argc=0, argv=0x0000000000000000, pReturnValue=0x00007ffe1d00d490) at corhost.cpp:460:39
frame dotnet/runtime#3864: 0x00007f71a4c4cccf libcoreclr.so`::coreclr_execute_assembly(hostHandle=0x00000000014adca0, domainId=1, argc=0, argv=0x0000000000000000, managedAssemblyPath="/home/user/p/test/hello_csharp/calculate_prime/bin/Debug/netcoreapp3.0/calculate_prime.dll", exitCode=0x00007ffe1d00d490) at unixinterface.cpp:412:24
frame dotnet/runtime#3865: 0x00007f71a5c931c3 libhostpolicy.so`___lldb_unnamed_symbol111$$libhostpolicy.so + 755
frame dotnet/coreclr#35: 0x00007f71a5c93a1c libhostpolicy.so`corehost_main + 204
frame dotnet/coreclr#36: 0x00007f71a5ef7b10 libhostfxr.so`___lldb_unnamed_symbol383$$libhostfxr.so + 1536
frame dotnet/runtime#3866: 0x00007f71a5ef66a3 libhostfxr.so`___lldb_unnamed_symbol381$$libhostfxr.so + 691
frame dotnet/coreclr#38: 0x00007f71a5ef16f3 libhostfxr.so`hostfxr_main_startupinfo + 147
frame dotnet/runtime#3867: 0x000000000040c441 dotnet`___lldb_unnamed_symbol106$$dotnet + 785
frame dotnet/runtime#3868: 0x000000000040c9fe dotnet`___lldb_unnamed_symbol107$$dotnet + 142
frame dotnet/coreclr#41: 0x00007f71a6149b97 libc.so.6`__libc_start_main(main=(dotnet`___lldb_unnamed_symbol107$$dotnet), argc=2, argv=0x00007ffe1d00d9d8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffe1d00d9c8) at libc-start.c:310
frame dotnet/coreclr#42: 0x0000000000402f40 dotnet`_start + 41
my debug output:
StackFrameIterator::Init m_crawl.pFrame = 00007FFE1D00AD80
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 frameState SFITER_INITIAL_NATIVE_CONTEXT <-- first iteration
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 fStop 0 NextRaw() = ...
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 ...
StackFrameIterator::ProcessCurrentFrame m_crawl.pFrame 00007FFE1D00AD80 m_crawl.pFunc = 0000000000000000
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 frameState SFITER_FRAME_FUNCTION <-- second iteration
StackFrameIterator::Filter m_crawl.pFrame 00007FFE1D00AD80 fStop 0 NextRaw() = ...
StackFrameIterator::NextRaw m_crawl.pFrame 00007FFE1D00AD80 uFrameAttribs 8 m_crawl.isFirst = 1 adr = 00007F712BEE99C1 <-- and now it will fail to avoid deadlock
StackFrameIterator::Filter fStop 0 NextRaw() = 2
DoStackSnapshotHelper pThreadToSnapshot=0000000001589130[curr 0]} pctxSeed=0000000000000000 hr=80004005 <-- E_FAIL
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 1
- Comments: 19 (19 by maintainers)
Thanks @k15tfu for the callstacks. I haven’t forgotten about this, I’ve just been busy with some other things lately. I hope to get to this relatively soon
We return
CORPROF_E_RUNTIME_UNINITIALIZED
when the runtime isn’t initialized yet. The specific check we do is forg_fEEStarted
which gets set to true shortly after setting up the default appdomain. If you start sampling at the first appdomain created notification you’ll probably get a fewCORPROF_E_RUNTIME_UNINITIALIZED
before you can start sampling.There’s nothing wrong with calling
SuspendRuntime
before initialization though, it won’t cause any issues. I don’t see any reason to try to synchronize it, just callSuspendRuntime
and once it stops returningCORPROF_E_RUNTIME_UNINITIALIZED
your app is up and running. Just ignore the error before that.That means something else is already attempting to suspend the runtime. It could be the GC, or we have various other things that can cause a runtime suspension, although they are far more rare than GCs. You can either ignore that sample or block in
RuntimeSuspendStarted
when you get notified about the other suspension to do your sampling. That would effectively hijack the suspension. I don’t think there should be any issue with hijacking the suspension that way, other than making the GC/whatever else take longer than it would normally.Fixed in master with dotnet/coreclr#26985 and ported to 3.1 with dotnet/coreclr#27300
@k15tfu if you have time, can you check that the change in dotnet/coreclr#26985 fixes this issue for you? When I run your sample program locally it works now.