runtime: Test failed: System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries fails with Timeout
From https://github.com/dotnet/corefx/pull/41753
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries netcoreapp-Windows_NT-Debug-x64-(Windows.Nano.1809.Amd64.Open)
===========================================================================================================
C:\helix\work\workitem>"C:\helix\work\correlation\dotnet.exe" exec --runtimeconfig System.Text.RegularExpressions.Tests.runtimeconfig.json xunit.console.dll System.Text.RegularExpressions.Tests.dll -xml testResults.xml -nologo -nocolor -notrait category=nonnetcoreapptests -notrait category=nonwindowstests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing
Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.Text.RegularExpressions.Tests (found 135 test cases)
Starting: System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries [FAIL]
Timed out at 10/18/2019 11:24:33 AM after 60000ms waiting for remote process.
Wrote mini dump to: C:\helix\work\workitem\uploads\4708.wrrjax4w.c3b.dmp
Process ID: 4708
Handle: 912
Stack Trace:
/_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
/_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
/_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(71,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Promote_entries()
Finished: System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
System.Text.RegularExpressions.Tests Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 118.055s
----- end Fri 10/18/2019 11:25:06.32 ----- exit code 1 ----------------------------------------------------------
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw netcoreapp-Linux-Debug-x64-RedHat.7.Amd64.Open
===========================================================================================================
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec ~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
Discovering: System.Text.RegularExpressions.Tests (method display = ClassAndMethod, method display options = None)
Discovered: System.Text.RegularExpressions.Tests (found 135 test cases)
Starting: System.Text.RegularExpressions.Tests (parallel test collections = on, max threads = 2)
System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw [FAIL]
Timed out at 10/18/2019 11:07:51 AM after 60000ms waiting for remote process.
Process ID: 2283
Handle: 892
Stack Trace:
/_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(152,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose(Boolean disposing)
/_/src/Microsoft.DotNet.RemoteExecutor/src/RemoteInvokeHandle.cs(55,0): at Microsoft.DotNet.RemoteExecutor.RemoteInvokeHandle.Dispose()
/_/src/System.Text.RegularExpressions/tests/Regex.Cache.Tests.cs(106,0): at System.Text.RegularExpressions.Tests.RegexCacheTests.Ctor_Cache_Uses_dictionary_linked_list_switch_does_not_throw()
Finished: System.Text.RegularExpressions.Tests
=== TEST EXECUTION SUMMARY ===
System.Text.RegularExpressions.Tests Total: 1454, Errors: 0, Failed: 1, Skipped: 0, Time: 77.472s
~/work/82ca2546-9df2-4b32-ab1f-d00835fcbe01/Work/93c6bbb0-68cc-493d-8eb4-85722fafeac6/Exec
----- end Fri Oct 18 11:07:57 UTC 2019 ----- exit code 1 ----------------------------------------------------------
cc @ViktorHofer
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 97 (96 by maintainers)
Commits related to this issue
- CI TEST: Check typical Helix machines memory utilization Investigating https://github.com/dotnet/coreclr/issues/27305 — committed to jkotas/corefx by jkotas 5 years ago
- Add logging of service process details on timeout Contributes to https://github.com/dotnet/coreclr/issues/27305 — committed to dotnet/arcade by jkotas 5 years ago
- Add logging of service process details on timeout (#4253) Contributes to https://github.com/dotnet/coreclr/issues/27305 — committed to dotnet/arcade by jkotas 5 years ago
Kudos to @safern and @ericstj for quickly helping narrow it to 2 commits.
The diagnostic server PR https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde is the obvious culprit. It is closely coupled with the code where this hang occurs.
It is likely that the diagnostic server eventpipe code is doing something bad during shutdown, and that corruption is leading to this hang. We had similar intermittent shutdown bugs in eventpipe before, so it is not that surprising that a new one popped up after significant eventpipe change.
Let’s use this issue to track the original problem with machine oversubscription, and reactivate #35451 to track the hang in
EventUnregister
.cc @josalem
Based on this info, we were able to track down that the shas in between those two package versions are: c409bd0d3d80d6c674bb99b3ec0b2c15a891112c and 44ed52a8cd71d4dedc58943725a1fe3a14f4b38d
Then we looked at Kusto data to find the first build that hit this issue and it was: https://dev.azure.com/dnceng/public/_build/results?buildId=608569&view=results – which used: https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde to merge with master to checkout the repo for the build. The builds started failing in april 21st.
So based on this I can only thing of 2 commits in question: https://github.com/dotnet/runtime/commit/629dba55058b58a126d08fc37db30b0cfa844dde and https://github.com/dotnet/runtime/commit/aa5b2041471d7687dbddb1d275ea8a93943c43bf
@jkotas @noahfalk do you see anything that pops up there?
I hope we get some quick to capture data first, like at least recording if high memory is normal/frequent/correlated. Things like that are really low cost, so we can go to town collecting all manner of stuff at sort of higher level, and then we can dig down incrementally as we get data indicating which direction to go.
No problem at all. Thought it’s worth asking. @Anipik this might be something worth to pick up as it directly contributes to CI pass rate. cc @ericstj
-maxthread 1
sounds reasonable to me to start with.Does
-parallel collections
have any affect when you specify-maxthread 1
?There is still a chance that we find the machine to be in fine state when the test is starting, and then the expensive background services kick in while the test is running. It may be better to throttle this in the xunit itself to be more agile.
It makes me think: We have seen these hangs with remote executor. Maybe we should add the throttling to the remote executor:
I ran a few instances of our images that we are using and watched them execute.
For the first 15 minutes or so, windows appears to be doing some standard first run tasks. Memory and CPU usage varies, but is fairly high for this entire period (Memory usage is between 3 and 5 Gb, and CPU from 50% - 100%). Processes and services are spinning up and down pretty heavily, so it doesn’t look like it’s any one specific thing that I would point to.
Once that levels off, the usage drops sharply and stays low. This would align with Matt’s observations that these failures seem to happen in the first few items run on a machine.
Rebooting the machine while it’s doing these things doesn’t appear to help much. It just resumes what it was doing when it starts up again, and just adds the reboot cycle time to the overall stabilization time.
While the resource usage is high, it’s not 100%, so the machine is still capable of doing useful work during this point, just a bit more slowly than once they hit their stride.