firebase-ios-sdk: Firestore.terminate hangs, and never calls callback

Description

There are certain scenarios where I’ll call Firestore’s terminate function, and it will hang, never calling the callback. I haven’t been able to create a reproducer, but can consistently get my product’s code to get into this state (not shareable for obvious reasons).

This appears to happen when there’s an overload of firestore document changes. I am able to see that the com.google.firebase.firestore queue’s target queue is com.apple.root.default-qos.overcommit though not sure what this means. It looks like the queue starts with this value set as the target queue, could be a red herring.

Here’s a screenshot of the backtrace of where firestore hangs… It looks like it’s unable to close the gRPC watch stream for the RemoteStore, stuck on WaitUntilOffQueue.

I’m basically stuck trying to debug this to see where things go wrong and what I can do to avoid this. I’m very unfamiliar with this codebase, so I don’t even know what to look for to find where the code is getting stuck. Any help with pointers on where to look or what to check would be very appreciated.

Reproducing the issue

Can’t share reproduction steps, as I can’t share the relevant code.

Firebase SDK Version

10.22

Xcode Version

15.3

Installation Method

CocoaPods

Firebase Product(s)

Firestore

Targeted Platforms

iOS

Relevant Log Output

There are no relevant logs.

If using Swift Package Manager, the project’s Package.resolved

None

If using CocoaPods, the project’s Podfile.lock

Can’t share

About this issue

Original URL
State: open
Created 4 months ago
Reactions: 1
Comments: 28 (26 by maintainers)

Most upvoted comments

@sergiocampama you deserve an award for this investigation.

ehsannas on Mar 14, 2024

This is amazing. Thank you for sharing the debugging process, it is very informative! I will bring this to our team discussion and keep you updated.

milaGGL on Mar 14, 2024

Created a gRPC++ issue here: https://github.com/grpc/grpc/issues/36115

sergiocampama on Mar 13, 2024

Figured out how to enable grpc tracing, was able to see that we are indeed getting a RST_STREAM from the backend with code 0

here is the trace logs: https://gist.githubusercontent.com/sergiocampama/c7ebecd21f42ca948358cd45bc1d5a77/raw/0fd3b4e2e8ff8c5639c9c3bca48be1be9a7d8f1e/gistfile1.txt

what I was able to find was that the watch stream does indeed get a RST_STREAM message, and then we restart that stream, as you can see there are multiple SEND_INITIAL_METADATA{:path: /google.firestore.v1.Firestore/Listen events.

what I think is happening is that on overload, the grpc++ batch processor continues sending messages to the backend, after the stream has closed… so it’s not detecting that event and canceling the subsequent messages, that generates a discardable error, but an error nonetheless, and that indeed is taken as a (wrong) signal that the stream is broken and that we need to close it, which in turns puts Firestore into a broken state where it can not be recovered from

sergiocampama on Mar 13, 2024

just tested that I could continue to use the write stream by creating new firestore documents after getting the operation of type 2 failed, so this is specifically about Datastore’s WatchStream getting completely stuck when it fails to write a message, and firestore failing to recover from this state even after attempting to terminate the firestore instance, hopefully this is enough information that you can create a test to simulate this condition (just the failure to write on the watch stream) to reproduce…

ideally, we should be able to find the root cause of the write failure, but sans that, allowing firestore to recover from this state without requiring an app restart should be the higher priority IMO

sergiocampama on Mar 13, 2024