firebase-ios-sdk: Firestore.terminate hangs, and never calls callback
Description
There are certain scenarios where I’ll call Firestore’s terminate function, and it will hang, never calling the callback. I haven’t been able to create a reproducer, but can consistently get my product’s code to get into this state (not shareable for obvious reasons).
This appears to happen when there’s an overload of firestore document changes. I am able to see that the com.google.firebase.firestore queue’s target queue is com.apple.root.default-qos.overcommit though not sure what this means. It looks like the queue starts with this value set as the target queue, could be a red herring.
Here’s a screenshot of the backtrace of where firestore hangs… It looks like it’s unable to close the gRPC watch stream for the RemoteStore, stuck on WaitUntilOffQueue.
I’m basically stuck trying to debug this to see where things go wrong and what I can do to avoid this. I’m very unfamiliar with this codebase, so I don’t even know what to look for to find where the code is getting stuck. Any help with pointers on where to look or what to check would be very appreciated.
Reproducing the issue
Can’t share reproduction steps, as I can’t share the relevant code.
Firebase SDK Version
10.22
Xcode Version
15.3
Installation Method
CocoaPods
Firebase Product(s)
Firestore
Targeted Platforms
iOS
Relevant Log Output
There are no relevant logs.
If using Swift Package Manager, the project’s Package.resolved
None
If using CocoaPods, the project’s Podfile.lock
Can’t share
About this issue
- Original URL
- State: open
- Created 4 months ago
- Reactions: 1
- Comments: 28 (26 by maintainers)
@sergiocampama you deserve an award for this investigation.
This is amazing. Thank you for sharing the debugging process, it is very informative! I will bring this to our team discussion and keep you updated.
Created a gRPC++ issue here: https://github.com/grpc/grpc/issues/36115
Figured out how to enable grpc tracing, was able to see that we are indeed getting a RST_STREAM from the backend with code 0
here is the trace logs: https://gist.githubusercontent.com/sergiocampama/c7ebecd21f42ca948358cd45bc1d5a77/raw/0fd3b4e2e8ff8c5639c9c3bca48be1be9a7d8f1e/gistfile1.txt
what I was able to find was that the watch stream does indeed get a RST_STREAM message, and then we restart that stream, as you can see there are multiple
SEND_INITIAL_METADATA{:path: /google.firestore.v1.Firestore/Listenevents.what I think is happening is that on overload, the grpc++ batch processor continues sending messages to the backend, after the stream has closed… so it’s not detecting that event and canceling the subsequent messages, that generates a discardable error, but an error nonetheless, and that indeed is taken as a (wrong) signal that the stream is broken and that we need to close it, which in turns puts Firestore into a broken state where it can not be recovered from
just tested that I could continue to use the write stream by creating new firestore documents after getting the
operation of type 2 failed, so this is specifically about Datastore’s WatchStream getting completely stuck when it fails to write a message, and firestore failing to recover from this state even after attempting to terminate the firestore instance, hopefully this is enough information that you can create a test to simulate this condition (just the failure to write on the watch stream) to reproduce…ideally, we should be able to find the root cause of the write failure, but sans that, allowing firestore to recover from this state without requiring an app restart should be the higher priority IMO