firebase-admin-node: All firestore queries hang for almost exactly 1 hour

[REQUIRED] Step 2: Describe your environment

Operating System version: Alpine 3.16
Firebase SDK version: 12.0.0
Firebase Product: database, we also use auth
Node.js version: v18.11.0, also observed on v20.11.1
NPM version: 8.19.2

[REQUIRED] Step 3: Describe the problem

Periodically, reads from firebase using the admin SDK will hang for an extended period. We’ve isolated the hang to reading a single pretty small document out of a particular collection, which happens on every request. The document simply doesn’t return for an extended period. Once a process gets in this state, it remains “stuck” for either 55 minutes or an hour, almost exactly. Firestore auth succeeds, but subsequent reads from firebase hang. Then, with no user intervention required, it just starts working properly again.

Steps to reproduce:

We’ve been unable to identify consistent reproduce steps, but we believe the steps to reproduce are simply doing concurrent auths, database reads, and potentially a small number of database writes (to different docs) for multiple users at the same time.

We were able to reproduce this with the following environment variables set (though we still don’t really know what exactly reproduced it):

GRPC_VERBOSITY=debug
GRPC_TRACE=all

I notice that in the bad state, all queries end up logging load_balancing_call | [8682] Pick result: QUEUE subchannel: null status: undefined undefined

whereas in the happy state, we see Pick result: COMPLETE, a subchannel ID, and an IP address.

I’ve uploaded two log files: One is from shortly before the bad times started to shortly after it started, and the second one is from shortly before things started to correct to shortly after. I found the timings very suspicious, exactly 1 hour seems too good to be true. There’s a lot of logging going on, and pulling it all from GCP + removing our app logs isn’t a great experience so I didn’t upload all our output for the hour, but these log files should be fairly representative.

Last successful read : 2024-03-13T20:42:24.614Z (first file) First hanging read : 2024-03-13T20:42:25.901Z (first file) Change in behavior : 2024-03-13T21:37:12.490Z (second file) - a bunch of what looks like connection cleanup happens here Things fully resolve :~2024-03-13T21:42:??? (observed application acting healthy by this point, but it definitely could have been 5 minutes earlier as we’d stopped paying careful attention at this point) bad-times-begin.txt bad-times-end.txt

Another thing we’ve noticed, and which I have a hard time believing is more than a coincidence, is that this generally happens at roughly 45 minutes past the hour.

When this occurs, we see a DEADLINE_EXCEEDED from any writes we’re doing concurrently (creating a doc in a collection).

Relevant Code:

This is with some abstractions removed, but the flow of it is:

// once
const admin = require("firebase-admin");
admin.initializeApp({ credential: admin.credential.applicationDefault() });

// every request
const idToken = await Post(`https://securetoken.googleapis.com/v1/token?key=${apiKey}`, { refresh_token, grant_type: "refresh_token" }).then(response => response.data.id_token);
const decodedToken = await admin.auth().verifyIdToken(idToken);
// hangs here
const userDoc = await admin.firestore().collection("users").doc(decodedToken.uid).get().then(doc => doc.data());

When it’s in the hanging state, this will trigger a DEADLINE_EXCEEDED

const doc = admin.firestore().collection("orgs").doc()
// triggers an error with {"code":4,"details":"Deadline exceeded","metadata":{},"note":"Exception occurred in retry method that was not classified as transient"}
const result = await doc.set({/* some data */}).then(() => doc.id);

About this issue

Original URL
State: open
Created 4 months ago
Reactions: 2
Comments: 17

Most upvoted comments

Yes! (also internally tracked b/331664537)

ehsannas on Mar 27, 2024

Just adding to Jacob’s comments above, is there a better way to pinpoint what may be hitting a platform limit? Some telemetry of some kind or log?

We are able to recreate this every few hours on only one of our 6 kubernetes replicas…

Also - there are some reasons I don’t think it’s related to the other issues you mentioned Ehsan

we should get at least some response instead of a hang of an hour+, i don’t think anyone has reported an hour+ hang on reads like we are seeing in any of the threads you linked Ehsan
we would likely see it across all of our replicas as opposed to just one if it were load or limits related.
it isn’t reliably recreatable… if it were a load issue one would expect that we would see it recreate more consistently than just once every couple hours…

any further ideas…?

michaelAtCoalesce on Mar 15, 2024

hmmm… I see a lot of “deadline exceeded” errors. What is your usage pattern like? Are you reaching the Maximum Writes per second?

Also please check: this and this.

ehsannas on Mar 14, 2024