firebase-admin-node: All firestore queries hang for almost exactly 1 hour
[REQUIRED] Step 2: Describe your environment
- Operating System version: Alpine 3.16
- Firebase SDK version: 12.0.0
- Firebase Product: database, we also use auth
- Node.js version: v18.11.0, also observed on v20.11.1
- NPM version: 8.19.2
[REQUIRED] Step 3: Describe the problem
Periodically, reads from firebase using the admin SDK will hang for an extended period. We’ve isolated the hang to reading a single pretty small document out of a particular collection, which happens on every request. The document simply doesn’t return for an extended period. Once a process gets in this state, it remains “stuck” for either 55 minutes or an hour, almost exactly. Firestore auth succeeds, but subsequent reads from firebase hang. Then, with no user intervention required, it just starts working properly again.
Steps to reproduce:
We’ve been unable to identify consistent reproduce steps, but we believe the steps to reproduce are simply doing concurrent auths, database reads, and potentially a small number of database writes (to different docs) for multiple users at the same time.
We were able to reproduce this with the following environment variables set (though we still don’t really know what exactly reproduced it):
GRPC_VERBOSITY=debug
GRPC_TRACE=all
I notice that in the bad state, all queries end up logging load_balancing_call | [8682] Pick result: QUEUE subchannel: null status: undefined undefined
whereas in the happy state, we see Pick result: COMPLETE, a subchannel ID, and an IP address.
I’ve uploaded two log files: One is from shortly before the bad times started to shortly after it started, and the second one is from shortly before things started to correct to shortly after. I found the timings very suspicious, exactly 1 hour seems too good to be true. There’s a lot of logging going on, and pulling it all from GCP + removing our app logs isn’t a great experience so I didn’t upload all our output for the hour, but these log files should be fairly representative.
Last successful read : 2024-03-13T20:42:24.614Z (first file) First hanging read : 2024-03-13T20:42:25.901Z (first file) Change in behavior : 2024-03-13T21:37:12.490Z (second file) - a bunch of what looks like connection cleanup happens here Things fully resolve :~2024-03-13T21:42:??? (observed application acting healthy by this point, but it definitely could have been 5 minutes earlier as we’d stopped paying careful attention at this point) bad-times-begin.txt bad-times-end.txt
Another thing we’ve noticed, and which I have a hard time believing is more than a coincidence, is that this generally happens at roughly 45 minutes past the hour.
When this occurs, we see a DEADLINE_EXCEEDED from any writes we’re doing concurrently (creating a doc in a collection).
Relevant Code:
This is with some abstractions removed, but the flow of it is:
// once
const admin = require("firebase-admin");
admin.initializeApp({ credential: admin.credential.applicationDefault() });
// every request
const idToken = await Post(`https://securetoken.googleapis.com/v1/token?key=${apiKey}`, { refresh_token, grant_type: "refresh_token" }).then(response => response.data.id_token);
const decodedToken = await admin.auth().verifyIdToken(idToken);
// hangs here
const userDoc = await admin.firestore().collection("users").doc(decodedToken.uid).get().then(doc => doc.data());
When it’s in the hanging state, this will trigger a DEADLINE_EXCEEDED
const doc = admin.firestore().collection("orgs").doc()
// triggers an error with {"code":4,"details":"Deadline exceeded","metadata":{},"note":"Exception occurred in retry method that was not classified as transient"}
const result = await doc.set({/* some data */}).then(() => doc.id);
About this issue
- Original URL
- State: open
- Created 4 months ago
- Reactions: 2
- Comments: 17
Yes! (also internally tracked b/331664537)
Just adding to Jacob’s comments above, is there a better way to pinpoint what may be hitting a platform limit? Some telemetry of some kind or log?
We are able to recreate this every few hours on only one of our 6 kubernetes replicas…
Also - there are some reasons I don’t think it’s related to the other issues you mentioned Ehsan
any further ideas…?
hmmm… I see a lot of “deadline exceeded” errors. What is your usage pattern like? Are you reaching the Maximum Writes per second?
Also please check: this and this.