arcade: [iOS] Infrastructure is failing to report successful test run

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=89051

Build leg reported

System.Runtime.Tests.WorkItemExecution

Pull Request

https://github.com/dotnet/runtime/pull/78288

Action required for the engineering services team

To triage this issue (First Responder / @dotnet/dnceng):

  • Open the failing build above and investigate
  • Add a comment explaining your findings

If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:

  1. Add the label “Known Build Error”
  2. Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation
{
   "ErrorMessage" : "[TCP tunnel] Xamarin.Hosting: Failed to connect to port",
   "BuildRetry": true
}

Additional information about the issue reported

System.Runtime.Tests are reported as failing. It looks like a failure in the reporting infrastructure. System.Runtime.Tests succeeded according to the net.dot.System.Runtime.Tests.log file:

Tests run: 50259 Passed: 50025 Inconclusive: 0 Failed: 0 Ignored: 133 Skipped: 101
Killing process 95249 as it was cancelled
[TerminateWithSuccess]

Report

Build Definition Test Pull Request
89138 dotnet/runtime System.Runtime.Tests.WorkItemExecution dotnet/runtime#78593

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 1

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 34 (34 by maintainers)

Most upvoted comments

Seems like the new fixes are already working! There are the TCP problems being detected:

image

But what is more important is that the retries happening for these cases now dropped the overall failure rate of AppleTV jobs to 0:

image

cc @AlitzelMendez @ilyas1974

Thank you. Do you think that the following is something worthy of a try?

We’ve tried variations of this with the same result, so no tbh. I think, as you pointed out, it’s an issue w/ mlaunch/xcode/osx or something within runtime. I don’t believe it’s the latter.

Anyway, these findings suggest that there may be an issue with the OS/Xcode or runtime network stack. To investigate this further and potentially repro the issue, it may be helpful to run an iOS app using the same mlaunch flow that Helix is using to run it, and collect the output over TCP locally.

When this was done in the past, no one could reproduce the issue locally. @akoeplinger suspect’s the tcp session is done over wifi as opposed to usb-max. I don’t think that’s something we’ve done locally and is worthy of a try.

Probably I didn’t fully understand your comment at first. Are you saying that if a TCP connection is not established, XHarness will not launch the app but instead hit a timeout? According to the logs, it seems that the app had started, and during its execution a timeout occured.

It happens in this order:

  1. XHarness starts, checks the app
  2. Finds a device (via mlaunch)
  3. Installs app on the device (via mlaunch)
  4. In the background, starts another instance of mlaunch to create the TCP tunnel between MacOS and the device
  5. Starts a TCP listener on the port it gave mlaunch in the previous step
  6. Waits for a signal log line from the TCP mlaunch (step 4) and when it says “TCP tunnel waiting for connections”, proceeds next.
  7. Starts the app on the phone (also mlaunch)

Now… XHarness doesn’t have visibility into the device so it doesn’t know if the app started well. The app can now crash for instance so it never connects. The app also might fail to connect to the TCP tunnel. For XHarness it’s all the same. If it doesn’t connect in a specific time (argument --launch-timeout), it times out.

Now from the point of view of the app, the app can start and as argument (or envvar) it receives a port where the tunnel should be. If it can’t connect there, it will log in it’s stdout instead of the TCP tunnel (this is the TestRunner link I sent).

There’s also this diagram for this - https://github.com/dotnet/arcade/issues/11700

To me, it seems, that in the last few months, mlaunch opens the tunnel but then has issues keeping it open. If you check the logs, it flops from “awaiting connection on port XY” to “failed to open the tunnel” again. We try to detect this state with a recent change and categorize this as TCP failure.

  • Create a sample PR that intentionally breaks the test application during startup, and update the CI to display the corresponding > error logs instead of TCP failure

When making the last change, we did give the app a wrong port purposefully to test the TCP tunnel not being there.

  • Confirm that the XHarness don’t to run an application if a TCP connection failure occurs

The problem is that the TCP tunnel starts in a while and then things can work fine. Example: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-arcade-refs-heads-release-60-ac03203fd7924eddad/zipped-apps/1/console.3dbd4769.log?helixlogtype=result

These are the main things:


### APP INSTALLATION

[15:46:40] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --list-crash-reports=/tmp/helix/working/B2D609CE/t/tmpcE9ZNU.tmp --devname DNCENGTVOS-026 -v -v -v -v -v
[15:46:40] dbug: Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:40] dbug: Xamarin.Hosting: Device discovery started
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0187767
[15:46:40] dbug: Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0178724
[15:46:40] dbug: Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:40] dbug: Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0014046
[15:46:41] dbug: Process mlaunch exited with 0

### STARTING TCP TUNNEL

[15:46:41] dbug: Starting TCP tunnel between mac port: 55899 and device port 55899
[15:46:41] dbug: [TCP tunnel]
[15:46:41] dbug: [TCP tunnel] Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app --tcp-tunnel=55899:55899 --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -v -v -v -v -v
[15:46:41] dbug: [TCP tunnel] Using Xcode 12.4 found in /Applications/Xcode124.app
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery started
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:7e:84:2cfff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (fff44:c6:5d:82:ac:e7fff)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-02 (fff44:c6:5d:7e:84:2cfff) in 00:00:00.0186957
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to Vsm-xam-appletv-01 (fff44:c6:5d:82:ac:e7fff) in 00:00:00.0155840
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Device discovery event: Connected (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b)
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Connected to DNCENGTVOS-026 (93c1ae4069a5601dc0d7e8725f9848815c5f3f7b) in 00:00:00.0011462
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Creating USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:41] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61

...

[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:45] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:46] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:46] dbug: [TCP tunnel] Please connect the device 'Tcp tunnel started on device port 55899 to host port 55899 for device 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b'...
[15:46:46] dbug: TCP tunnel created on port 55899

### WE SAW THAT TCP TUNNEL STARTED WELL
### STARTING THE APP


[15:46:46] dbug: Starting the application
[15:46:46] dbug: 
[15:46:46] dbug: Running /private/tmp/helix/working/B2D609CE/p/microsoft.dotnet.xharness.cli/1.0.0-prerelease.22327.1/tools/net6.0/any/../../../runtimes/any/native/mlaunch/bin/mlaunch --sdkroot /Applications/Xcode124.app -setenv=NUNIT_AUTOEXIT=true -setenv=NUNIT_HOSTPORT=55899 -setenv=NUNIT_ENABLE_XML_OUTPUT=true -setenv=NUNIT_XML_VERSION=xUnit -setenv=NUNIT_HOSTNAME=10.105.153.129,169.254.177.29,fe80::1%1,fe80::14ee:7dbb:3d47:cae1%4,2001:4898:9008:60:10a3:9ee6:c47a:ea01,2001:4898:9008:60:6131:89e0:ab6b:c9ac,2001:4898:9008:60:6071:4e8:3e9b:5ec1,2001:4898:9008:60:f449:e1d5:b345:31d6,2001:4898:9008:60:5dee:4dd2:58f:174,2001:4898:9008:60:bd42:319b:cd83:da0d,2001:4898:9008:60:4535:ba78:97ae:15cb,fe80::aede:48ff:fe00:1122%5,fe80::4f7:69d5:88c2:9f52%6,fe80::dcaf:aaff:fe16:42a1%10,fe80::dcaf:aaff:fe16:42a1%11,fe80::51a3:d95b:3171:8da0%17,fe80::a003:cdd2:b0b5:f493%18 --disable-memory-limits --devname 93c1ae4069a5601dc0d7e8725f9848815c5f3f7b -setenv=USE_TCP_TUNNEL=true --launchdevbundleid net.dot.System.Buffers.Tests --wait-for-exit -v -v -v -v -v


### NOW TCP TUNNEL BREAKS AGAIN

[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Failed to connect to port 55899 (23514) on device (error: 61)
[15:46:47] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 61
...

### NOW IT WORKS AGAIN AND APP CONNECTS ACTUALLY

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Attempting USB tunnel between the port 55899 on the device and the port 55899 (23514) on the mac: 0
[15:46:50] dbug: Test log server listening on: localhost:55899

###

[15:46:50] dbug: [TCP tunnel] Xamarin.Hosting: Created USB tunnel between the port 55899 on the device and the port 55899 on the mac.
[15:46:50] dbug: Connection from 127.0.0.1:55899 saving logs to /tmp/helix/working/B2D609CE/w/AA2A0964/uploads/test-tvos-device-20230411_154640.log
[15:46:50] dbug: Test execution started

### THE TEST FINISHES FINE AFTER THIS

Something must have changed in mlaunch or in MacOS but we didn’t see this behaviour ~6 months ago and running device tests was actually super smooth. I am happy to meet in the office and we can talk about this in person as it’s quite complicated.

  • I don’t understand why the TCP connection seems to be unstable
  • XHarness could print a stack trace and error message when an error occurs for better diagnostics
  • It might be possible that multiple TCP connection attempts being made within a short period of time could overload a device

It’s actually not XHarness doing the TCP but mlaunch - a tool used by XHarness and VS to talk to Apple devices/simulators. We didn’t use to have this many issues with TCP but it’s possible something regressed between mlaunch and new MacOS versions.

  • If an app never connects over TCP, XHarness shouldn’t execute tests to avoid timeouts

This happens in the TestRunner - https://github.com/dotnet/xharness/blob/389c851b0dc1d2c50d03e4aad000b7802d0ebed6/src/Microsoft.DotNet.XHarness.TestRunners.Common/iOSApplicationEntryPointBase.cs#L26

@kotlarmilos there are still issues with the tcp tunnel even though your PR may have contributed to the spike of failures.

We have agreed that we could make XHarness recognize TCP issues and return some extra exit code and then have a known build error for that case.

I don’t think we should create “known issue” for this as it might cover potential problems. I think enabling retries is the best next step. I will add open a PR for that. I checked the telemetry and there are not many APP_CRASH results (just 3 for iPhones in the last week or so).

Regardless… seems like we are not using the new iPhones though - there are 50 iPhones with no work - we should also enable tests for iPhones too probably, not just AppleTVs?

We can enable retries and see if they help. I thought we added iphones to the rolling build. I’ll definitely do that if it’s not currently.