runtime: Test hang in System.Diagnostics.Tracing.EventProvider.EventUnregister

      Half-way through waiting for remote process.
      Memory load: 14
      
      Image Name                     PID Services                                    
      ========================= ======== ============================================
      svchost.exe                    620 BrokerInfrastructure, DcomLaunch, LSM,      
                                         PlugPlay, Power, SystemEventsBroker         
      svchost.exe                    652 RpcEptMapper, RpcSs                         
      svchost.exe                    776 Dhcp, EventLog, lmhosts, vmictimesync,      
                                         Wcmsvc                                      
      svchost.exe                    808 AeLookupSvc, Appinfo, BITS, CertPropSvc,    
                                         gpsvc, iphlpsvc, LanmanServer, ProfSvc,     
                                         sacsvr, Schedule, SENS, SessionEnv,         
                                         ShellHWDetection, Themes, Winmgmt           
      svchost.exe                    888 EventSystem, FontCache, netprofm, nsi,      
                                         W32Time, WinHttpAutoProxySvc                
      svchost.exe                    976 CryptSvc, Dnscache, LanmanWorkstation,      
                                         NlaSvc, WinRM                               
      svchost.exe                    592 BFE, DPS, MpsSvc, pla                       
      svchost.exe                   1136 AppHostSvc                                  
      svchost.exe                   1164 DiagTrack                                   
      svchost.exe                   1328 TrkWks, UALSVC, UmRdpService,               
                                         vmickvpexchange, vmicshutdown, vmicvss      
      svchost.exe                   1352 W3SVC, WAS                                  
      svchost.exe                   2632 TermService                                 
      svchost.exe                   2672 vmicheartbeat, vmicrdv                      
      Timed out at 4/25/2020 1:55:40 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\h\w\AE3609A5\w\B05309CA\uploads\3476.zuzob0j0.l20.dmp
      	Process ID: 3476
      	Handle: 1004
      	Name: dotnet
      	MainModule: C:\h\w\AE3609A5\p\dotnet.exe
      	StartTime: 4/25/2020 1:54:40 AM
      	TotalProcessorTime: 00:00:00.2343750
      	Threads:
      		Thread #1 (OS 0x560)   
      			[InlinedCallFrame] (System.Environment._Exit)
      			[InlinedCallFrame] (System.Environment._Exit)
      			System.Environment.Exit(Int32)
      			Microsoft.DotNet.RemoteExecutor.Program.Main(System.String[])
      		Thread #2 (OS 0xA38) [Finalizer] [Background] [MTA]
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			[InlinedCallFrame] (Interop+Advapi32.EventUnregister)
      			System.Diagnostics.Tracing.EtwEventProvider.System.Diagnostics.Tracing.IEventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.EventUnregister(Int64)
      			System.Diagnostics.Tracing.EventProvider.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventProvider.Dispose()
      			System.Diagnostics.Tracing.EventSource.Dispose(Boolean)
      			System.Diagnostics.Tracing.EventListener.DisposeOnShutdown(System.Object, System.EventArgs)
      			System.AppContext.OnProcessExit()
      		Thread #3 (OS 0x700) [Thread pool worker] [Background] [MTA]
      		Thread #4 (OS 0xE50) [Thread pool worker] [Background] [MTA]

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 23 (22 by maintainers)

Most upvoted comments

I have stepped through the Event Pipe IPC shutdown sequence. I believe that there is a clear HANDLE use-after-close bug:

Handle that is being closed here:

00 CoreCLR!IpcStream::DiagnosticsIpc::Close+0xbb [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 213] 
01 CoreCLR!IpcStreamFactory::ConnectionState::Close+0x42 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.h @ 38] 
02 CoreCLR!IpcStreamFactory::CloseConnections+0x71 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 115] 
03 CoreCLR!DiagnosticServer::Shutdown+0x1e8 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 229] 
04 CoreCLR!EEShutDownHelper+0x1b6 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1212] 
05 CoreCLR!EEShutDown+0x346 [D:\runtime\src\coreclr\src\vm\ceemain.cpp @ 1677] 
06 CoreCLR!CorHost2::UnloadAppDomain2+0xfb [D:\runtime\src\coreclr\src\vm\corhost.cpp @ 942] 

is being waited on - or about to be waited on - here:

00 KERNELBASE!WaitForMultipleObjects [minkernel\kernelbase\synch.c @ 1401] 
01 CoreCLR!IpcStream::DiagnosticsIpc::Poll+0x31f [D:\runtime\src\coreclr\src\debug\debug-pal\win\diagnosticsipc.cpp @ 311] 
02 CoreCLR!IpcStreamFactory::GetNextAvailableStream+0x1c2 [D:\runtime\src\coreclr\src\vm\ipcstreamfactory.cpp @ 162] 
03 CoreCLR!DiagnosticServer::DiagnosticsServerThread+0x379 [D:\runtime\src\coreclr\src\vm\diagnosticserver.cpp @ 50] 
04 KERNEL32!BaseThreadInitThunk+0x14 [base\win32\client\thread.c @ 64] 
05 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

How does this explain the hang:

  • The handle gets reused for auto-reset event allocated by OS loader code
  • The Wait operation in IpcStream will reset the auto-reset event
  • The OS loader code will hang waiting for the event forever becase of it was reset already

Why is it more likely to repro on Win7/8:

  • Differences in the OS scheduler and differences in the OS handle allocation algorithm

The shutdown sequence of the event pipe need careful review. Also, it would be a good idea to minimize work done during shutdown to only what’s absolutely necessary, e.g. these event handles do not really need to be closed since the OS will close them anyway as part of the process shutdown.

Reverting in #35767

Can we revert the offending PR? This is causing at least one leg in the majority of PRs to fail. It’s going to make it much more likely that other regressions slip in, as we’re having to rerun lots of legs and merge with failures that might be masking other failures, and it’s going to make it more likely devs assume that failures aren’t theirs even if they are in fact caused by their PR. Thanks!

The dump at \\jkotas9\drops\issue-35451 confirms this: pHandles[0] on thread 3 is 0x168, CriticalSection.LockSemaphore on thread 2 is 0x168

By “ping” I meant that it literally made a ping sound from the console. I had stepped away from my computer for a while, and it started beeping. It seems like after 1000 loops most of the runs are taking >10 seconds! The xunit output claims the tests ran in 5 seconds (still a big increase over the 0.6 second runtime from the first 900 loops), but the RuntTests.cmd script didn’t finish before 10 seconds. The logs also say that the tests passed. I’ll dig into this more tomorrow with some better logging and see if it actually means anything (and mute my computer before I call it a day 😝 ).

this only happens in Windows7 and Windows8.1 – I tried to repro locally on Windows 10 and after 150 runs it didn’t repro.

Interesting!

I’ll take a look at this today. Thanks for triaging things @safern! I’m not sure what would be causing this issue, but based on the versions being hit, I’m guessing it might be something with Overlapped IO in old versions of Windows. I’ll have to do some investigating.

Attaching windbg showed the loader lock was unowned:

0:002> dx -r1 ((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)
((ntdll!_RTL_CRITICAL_SECTION *)0x7ffaa438e8a8)                 : 0x7ffaa438e8a8 [Type: _RTL_CRITICAL_SECTION *]
    [+0x000] DebugInfo        : 0x7ffaa438ebe0 [Type: _RTL_CRITICAL_SECTION_DEBUG *]
    [+0x008] LockCount        : 0 [Type: long]
    [+0x00c] RecursionCount   : 0 [Type: long]
    [+0x010] OwningThread     : 0x0 [Type: void *]
    [+0x018] LockSemaphore    : 0x168 [Type: void *]
    [+0x020] SpinCount        : 0x4000000 [Type: unsigned __int64]

Best guess the loader lock was owned earlier causing the OS to switch the thread out, later the lock was released and for whatever reason the OS hasn’t yet switched the thread back in.

Jan’s suggestion of #13610 sounds very plausible to me. If we wanted more data on this failure (or failures of this class) we’d probably need to upgrade the telemetry to determine if the thread is making progress and what behavior is happening machine wide. A better set of telemetry for a suspected hung test to capture would be:

  1. process dump
  2. ETW trace including stack sampling and context switches machine wide for ~30 sec
  3. another process dump This would clearly show if progress was being made, as well as what else the machine is doing.

Closing this as duplicate of #13610