runtime: Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout

From https://github.com/dotnet/corefx/pull/41753

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213282&resultId=145033&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/421e5238-fb5c-4d4f-8d24-a8e79abd46eb/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================

C:\helix\work\workitem>"C:\helix\work\correlation\dotnet.exe" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonwindowstests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing  
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries [FAIL]
      Timed out at 10/18/2019 11:24:33 AM after 60000ms waiting for remote process.
      Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp
      	Process ID: 4708
      	Handle: 912
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(71,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 118.055s
----- end Fri 10/18/2019 11:25:06.32 ----- exit code 1 ----------------------------------------------------------

System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open

https://dev.azure.com/dnceng/public/_build/results?buildId=393033&view=ms.vss-test-web.build-test-results-tab&runId=12213340&paneView=debug

https://helix.dot.net/api/2019-06-17/jobs/82ca2546-9df2-4b32-ab1f-d00835fcbe01/workitems/System.Text.RegularExpressions.Tests/console

===========================================================================================================
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec ~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
  Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  System.Text.RegularExpressions.Tests (found 135 test cases)
  Starting:    System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
    System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw [FAIL]
      Timed out at 10/18/2019 11:07:51 AM after 60000ms waiting for remote process.
      	Process ID: 2283
      	Handle: 892
      
      Stack Trace:
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
        /_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
        /_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(106,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw()
  Finished:    System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
   System.Text.RegularExpressions.Tests  Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 77.472s
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
----- end Fri Oct 18 11:07:57 UTC 2019 ----- exit code 1 ----------------------------------------------------------

cc @ViktorHofer

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 97 (96 by maintainers)

Commits related to this issue

Most upvoted comments

Kudos to @safern and @ericstj for quickly helping narrow it to 2 commits.

The diagnostic server PR https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde is the obvious culprit. It is closely coupled with the code where this hang occurs.

It is likely that the diagnostic server eventpipe code is doing something bad during shutdown, and that corruption is leading to this hang. We had similar intermittent shutdown bugs in eventpipe before, so it is not that surprising that a new one popped up after significant eventpipe change.

Let’s use this issue to track the original problem with machine oversubscription, and reactivate #35451 to track the hang in EventUnregister.

cc @josalem

The frequency of this issue seems too high to be an existing issue. I suspect a recent regression. Also it seems to be observed in WinForms as well starting around 4/20 when the took a runtime update from 4.20218.3 to 4.20220.15. So we should look in this time window for a commit that could have introduced this hang.

Based on this info, we were able to track down that the shas in between those two package versions are: c409bd0d3d80d6c674bb99b3ec0b2c15a891112c and 44ed52a8cd71d4dedc58943725a1fe3a14f4b38d

Then we looked at Kusto data to find the first build that hit this issue and it was: https://dev.azure.com/dnceng/public/_build/results?buildId=608569&view=results – which used: https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde to merge with master to checkout the repo for the build. The builds started failing in april 21st.

So based on this I can only thing of 2 commits in question: https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde and https://github.com/dotnet/runtime/commit/aa5b2041471d7687dbddb1d275ea8a93943c43bf

@jkotas @noahfalk do you see anything that pops up there?

I hope we get some quick to capture data first, like at least recording if high memory is normal/frequent/correlated. Things like that are really low cost, so we can go to town collecting all manner of stuff at sort of higher level, and then we can dig down incrementally as we get data indicating which direction to go.

Sorry, I do not have cycles to deal with this.

No problem at all. Thought it’s worth asking. @Anipik this might be something worth to pick up as it directly contributes to CI pass rate. cc @ericstj

-maxthread 1 sounds reasonable to me to start with.

Does -parallel collections have any affect when you specify -maxthread 1?

There is still a chance that we find the machine to be in fine state when the test is starting, and then the expensive background services kick in while the test is running. It may be better to throttle this in the xunit itself to be more agile.

It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor:

  • When the machine is oversubscribed, wait until other remote executor processes finish before starting new one (ie do not have more than one running)
  • Otherwise, if the machine is still oversubscribed, wait 1 second before starting the remote executor process
  • Make sure to log diagnostics in any of these cases

I ran a few instances of our images that we are using and watched them execute.

For the first 15 minutes or so, windows appears to be doing some standard first run tasks. Memory and CPU usage varies, but is fairly high for this entire period (Memory usage is between 3 and 5 Gb, and CPU from 50% - 100%). Processes and services are spinning up and down pretty heavily, so it doesn’t look like it’s any one specific thing that I would point to.

Once that levels off, the usage drops sharply and stays low. This would align with Matt’s observations that these failures seem to happen in the first few items run on a machine.

Rebooting the machine while it’s doing these things doesn’t appear to help much. It just resumes what it was doing when it starts up again, and just adds the reboot cycle time to the overall stabilization time.

While the resource usage is high, it’s not 100%, so the machine is still capable of doing useful work during this point, just a bit more slowly than once they hit their stride.