arcade: [iOS] Infrastructure is failing to report successful test run
Build
Build leg reported
System.Runtime.Tests.WorkItemExecution
Pull Request
https://github.com/dotnet/runtime/pull/78288
Action required for the engineering services team
To triage this issue (First Responder / @dotnet/dnceng):
- Open the failing build above and investigate
- Add a comment explaining your findings
If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:
- Add the label “Known Build Error”
- Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation
{
"ErrorMessage" : "[TCP tunnel] Xamarin.Hosting: Failed to connect to port",
"BuildRetry": true
}
Additional information about the issue reported
System.Runtime.Tests are reported as failing. It looks like a failure in the reporting infrastructure. System.Runtime.Tests succeeded according to the net.dot.System.Runtime.Tests.log file:
Tests run: 50259 Passed: 50025 Inconclusive: 0 Failed: 0 Ignored: 133 Skipped: 101
Killing process 95249 as it was cancelled
[TerminateWithSuccess]
Report
| Build | Definition | Test | Pull Request |
|---|---|---|---|
| 89138 | dotnet/runtime | System.Runtime.Tests.WorkItemExecution | dotnet/runtime#78593 |
Summary
| 24-Hour Hit Count | 7-Day Hit Count | 1-Month Count |
|---|---|---|
| 1 | 1 | 1 |
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 34 (34 by maintainers)
Seems like the new fixes are already working! There are the TCP problems being detected:
But what is more important is that the retries happening for these cases now dropped the overall failure rate of AppleTV jobs to 0:
cc @AlitzelMendez @ilyas1974
We’ve tried variations of this with the same result, so no tbh. I think, as you pointed out, it’s an issue w/ mlaunch/xcode/osx or something within runtime. I don’t believe it’s the latter.
When this was done in the past, no one could reproduce the issue locally. @akoeplinger suspect’s the tcp session is done over wifi as opposed to usb-max. I don’t think that’s something we’ve done locally and is worthy of a try.
It happens in this order:
Now… XHarness doesn’t have visibility into the device so it doesn’t know if the app started well. The app can now crash for instance so it never connects. The app also might fail to connect to the TCP tunnel. For XHarness it’s all the same. If it doesn’t connect in a specific time (argument
--launch-timeout), it times out.Now from the point of view of the app, the app can start and as argument (or envvar) it receives a port where the tunnel should be. If it can’t connect there, it will log in it’s stdout instead of the TCP tunnel (this is the TestRunner link I sent).
There’s also this diagram for this - https://github.com/dotnet/arcade/issues/11700
To me, it seems, that in the last few months, mlaunch opens the tunnel but then has issues keeping it open. If you check the logs, it flops from “awaiting connection on port XY” to “failed to open the tunnel” again. We try to detect this state with a recent change and categorize this as TCP failure.
When making the last change, we did give the app a wrong port purposefully to test the TCP tunnel not being there.
The problem is that the TCP tunnel starts in a while and then things can work fine. Example: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-arcade-refs-heads-release-60-ac03203fd7924eddad/zipped-apps/1/console.3dbd4769.log?helixlogtype=result
These are the main things:
Something must have changed in mlaunch or in MacOS but we didn’t see this behaviour ~6 months ago and running device tests was actually super smooth. I am happy to meet in the office and we can talk about this in person as it’s quite complicated.
It’s actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn’t use to have this many issues with TCP but it’s possible something regressed between mlaunch and new MacOS versions.
This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26
@kotlarmilos there are still issues with the tcp tunnel even though your PR may have contributed to the spike of failures.
We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.
We can enable retries and see if they help. I thought we added iphones to the rolling build. I’ll definitely do that if it’s not currently.