runtime: InvalidCastException thrown after TimerQueueTimer.CallCallback

Description

After upgrading my iOS app from classic Xamarin.iOS to .NET7 I started observing frequent InvalidCastExeptions in my crash reporting tool (AppCenter). I haven’t changed much dependency-wise and I am unable to reproduce this locally.

Reproduction Steps

Unfortunately, I don’t know what part of my code causes this behavior. For the first stack trace, I assume that my code (or some dependencies code) is creating a CancellationTokenSource with a delay parameter. For the second stack trace below, I assume that some call to Task.Delay is causing the crash.

Expected behavior

No crash 😃

Actual behavior

SIGABRT: Arg_InvalidCastException

System.Threading.CancellationTokenSource.TimerCallback(Object )
System.Threading.TimerQueueTimer.CallCallback(Boolean )
System.Threading.TimerQueueTimer.Fire(Boolean )
System.Threading.TimerQueue.FireNextTimers()
System.Threading.TimerQueue.System.Threading.IThreadPoolWorkItem.Execute()
System.Threading.ThreadPoolWorkQueue.DispatchItemWithAutoreleasePool(Object , Thread )

this is here: https://github.com/dotnet/runtime/blob/215b39abf947da7a40b0cb137eab4bceb24ad3e3/src/libraries/System.Private.CoreLib/src/System/Threading/CancellationTokenSource.cs#L35

SIGABRT: Arg_InvalidCastException

System.Threading.Tasks.Task.DelayPromise.TimerCallback(Object )
System.Threading.TimerQueueTimer.CallCallback(Boolean )
System.Threading.TimerQueueTimer.Fire(Boolean )
System.Threading.TimerQueue.FireNextTimers()
System.Threading.TimerQueue.System.Threading.IThreadPoolWorkItem.Execute()
System.Threading.ThreadPoolWorkQueue.DispatchItemWithAutoreleasePool(Object , Thread )

this is here: https://github.com/dotnet/runtime/blob/26b58b99aeef79c48b8e29463a1e6f1855174e1b/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs#L5632

Both crashes originate from here: https://github.com/dotnet/runtime/blob/26b58b99aeef79c48b8e29463a1e6f1855174e1b/src/libraries/System.Private.CoreLib/src/System/Threading/Timer.cs#L706 The “_state” variable is nullable, and based on the reports, the crashes are caused by the variable to be indeed null.

Regression?

I did not see those crashes when using “old” Xamarin.iOS

Known Workarounds

No response

Configuration

.net7-ios happens on various iPhone models and iOS versions

Other information

No response

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Reactions: 1
  • Comments: 44 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Thanks for the detailed explanation (and I missed that we changed this code relatively recently - that explains why Xamarin wasn’t affected).

I’m going to try a local build of #93006 to verify that it makes the crash go away (It’s quite infrequent for me under Xcode so if I don’t see if after a couple hundred launches, I’m going to consider it resolved).

So where are we with this issue? Will we have to move to net8-ios to fix it (when it’s released) or will there be a patch to net7-ios at some point?

We have never seen this issue in the past, when running on the Xamarin framework. Only since our latest release where we changed to net7-ios. In both cases we are using LLVM. Is that expected? It seems to contradict what @lateralusX is saying.

Xamarin runs on a Mono branch 2020-2 and the critical changes around this race was introduced later in both mono/mono as well as dotnet/runtime, so running on a later mono/mono branch as well as dotnet/runtime branch will hit it, but not the branch used by legacy Xamarin. That might explain what you have been identifying. On the product I originally hit and analyzed the issue we didn’t see it until after upgrading to a later mono/mono commit, not included in Mono’s 2020-2 branch.

Yes, if you have got passed method_init and still sees NULL values in used GOT slots for that method, that probably means you hit the race where that thread exited its call to method_init before the stores into needed GOT slot has happened or became visible by the other thread. This is an LLVM only issue, meaning that it won’t reproduce when running without LLVM. When I originally investigate that issue it caused rare random crashes that could occur at any point during apps lifetime, because a method calls its method_init only on first call to to the method, and it needs to race with another thread doing the same thing to expose the potential race.

This fix has been implemented and used in a downstream repo running in some large apps, installed and executed in very large quantities for over a year, and it eliminated the in-frequent crashes previously seen by those apps and didn’t cause any regression (x64). It has been in dotnet/runtime main for over a year, so since the fix have been around for sometime and it has been applied to both mono/mono and dotnet/runtime repro’s. I believe it should be a small risk of backporting it to net7, especially since we see issues potentially affected by it and its LLVM only.

I wonder if it’s worth changing your repro so it waits for the entire HTTP response before continuing, and deliberately downloading a very large file, so the cancellation token actually fires while your app is backgrounded? This would simulate a poor internet connection situation.

To answer your questions @lambdageek,

  1. We use using the parameterless constructor for HttpClient so I believe that means NSUrlSessionHandler
  2. Yes, we pass a token to the HttpClient method. Here’s the actual code we use to fire off most requests in our app:
var cts = new System.Threading.CancellationTokenSource();
cts.CancelAfter(Timeout); // Timeout is always 20 seconds
httpResponse = await httpClient.SendAsync(message, HttpCompletionOption.ResponseHeadersRead, cts.Token).ConfigureAwait(false); // 2020/11/06 Added ConfigureAwait as it was necessary on Android, see https://github.com/xamarin/xamarin-android/issues/5264

I hope this is helpful for the repro.

Not that I’m aware of, no.

As I noted earlier in this thread, I also see this crash reported for apps that do a lot of concurrent HTTP requests. While these questions are to @divil5000, I’ll try to add my input as well, as I’d really like to see this problem resolved.

  • Are you using NSUrlSessionHandler or HttpClientHandler or something custom?

I am using NSUrlSessionHandler (default constructor)

2. When you say “We only use CancellationTokenSource (directly) when configuring an HttpClient” what exactly do you mean? You mean passing a cancellation token to httpClient.GetAsync() or something else?

I am passing a cancellation token in some situations, but when the crashes happen (I sometimes experienced them myself), they seem to happen right when the http request is being sent out, before the token is cancelled or request timeouts would trigger. I am pretty sure the crashes just happen with a plain HttpClient.SendAsync(), without passing a cancellation token

What does your repro look like at the moment? We’re only seeing a few exceptions per thousands of runs of our software in the wild. I can tell you the following, though:

  • We only use CancellationTokenSource (directly) when configuring an HttpClient, so that’s where I would focus testing
  • The relative lack of complaints coming back from customers makes me think this is happening when our app has been backgrounded by iOS

If I were trying to develop a repro I would be firing off lots of HttpClient requests for large files, all with cancellation triggered after a certain timeout (we always use 20 seconds in case that helps focus efforts) then pressing the Home button on the iPhone hardware in question to see if it reproduces after a little while.

We are also encountering this problem since “upgrading” from Xamarin to net7-ios. We have an install base of tens of thousands, and are seeing dozens of these reports coming in each day, having fully deployed our latest and greatest version. Rolling back would be extremely difficult so this puts us in an unfortunate position since the whole app crashes. Come on Microsoft, a little support?

MonoTouch: 16.5.0 iOS: 16.6.1 Hardware: iPhone14,2 System.InvalidCastException: Arg_InvalidCastException at System.Threading.CancellationTokenSource.TimerCallback(Object ) at System.Threading.TimerQueueTimer.CallCallback(Boolean ) at System.Threading.TimerQueueTimer.Fire(Boolean ) at System.Threading.TimerQueue.FireNextTimers() at System.Threading.TimerQueue.System.Threading.IThreadPoolWorkItem.Execute() at System.Threading.ThreadPoolWorkQueue.DispatchItemWithAutoreleasePool(Object , Thread ) at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() at System.Threading.Thread.StartCallback()

@durandt Can you please elaborate on how exactly you fix this issue ?

@suchithm I think I was over-optimistic. I could find which piece of code was crashing and there was I risk that code would sometimes have retained an old CancellationTokenSource and maybe call Cancel() on it. I fixed that and it seems that the crash happen less frequently in this scenario but they still do happen.

I have also reported this issue to the AppCenter folks as it seems this crash gets underreported. (something in the AppCenter libs may be silencing the crash, when it occurred I could not find the crashlog file on the iOS device).

@lambdageek I tested my apps on both simulator and physical device numerous times - I was never able to observe this myself (edit: I now am, using a TestFlight app and repeatedly starting it). The crash is not super-common, but also not super-rare. I updated both my apps ~1 week ago. Since then, one of the two apps that I am observing this is installed on ~5000 devices with around 2000 app sessions per day and the crash occurred 7 times (on 7 different devices). And the other app is installed on ~1000 devices with around 300 app sessions per day and the crash occurred 3 times (on 2 different devices).