firebase-functions: Exception occurred in retry method that was not classified as transient

Related issues

https://github.com/firebase/firebase-functions/issues/522

[REQUIRED] Version info

  "dependencies": {
    "@google-cloud/firestore": "^2.2.1",
    "firebase-admin": "^8.2.0",
    "firebase-functions": "^3.0.1",
  },
  "engines": {
    "node": "8"
  }

node: 8

firebase-functions: 3.0.1

firebase-tools: 7.0.1

firebase-admin: 8.2.0

Steps to reproduce

import * as admin from 'firebase-admin';

admin.initializeApp();
const db = admin.firestore();
db.settings({ timestampsInSnapshots: true });


const users = db.collection('users');

users.doc('myUserId').update({ fieldToUpdate: 'newValue' })

Update method throw this error :

{ Error at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:101:45) at emitOne (events.js:121:20) at Http2CallStream.emit (events.js:211:7) at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22) at _combinedTickCallback (internal/process/next_tick.js:132:7) at process._tickDomainCallback (internal/process/next_tick.js:219:9) code: 13, details: ‘’, metadata: Metadata { options: undefined, internalRepr: Map {} }, note: ‘Exception occurred in retry method that was not classified as transient’ }

Were you able to successfully deploy your functions?

the deployment displays no errors

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 81
  • Comments: 164 (6 by maintainers)

Commits related to this issue

Most upvoted comments

I am having the exact same problem

Our backend team believes that they know what the root cause is, but it might take quite a while for the issue to be fixed in all production environments.

For everyone: One of our engineers has narrowed down a behavior change between GRPC and GRPC-JS, which may be related to the issue you are seeing. We don’t have a fix yet and will keep this issue updated.

Why do I need to “warm up” a function?! What is it a car engine?

I have to add that in my experience the problem is also present with ‘http’ calls and not only with ‘triggers’. It seems it is (extra) present when an instance/function went to sleep and has indeed a ‘cold start’.

Any update? I am facing the same issue.

I’m having the same problem. I’m receiving this error.

Error at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:96:45) at emitOne (events.js:121:20) at Http2CallStream.emit (events.js:211:7) at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22) at _combinedTickCallback (internal/process/next_tick.js:132:7) at process._tickDomainCallback (internal/process/next_tick.js:219:9

Mine seems to happen randomly in onCreate and onWrite functions. These functions with this error are triggered daily and the error has only occurred once, I’ve had them fire multiple times after error occurred and error has yet to return. These errors start appearing once I updated firebase functions from version 2.3.0 to 3.0.1 and firebase admin from 7.0.0 to 8.0.0

Having same error

does this occur on every function invocation?

No. [following statement is might be wrong] this error happens only after idle function is invoked after longer period of time.

Is this only happening for a subset of projects/developers?
I don’t understand why this is not a bigger issue for people. This is still happening every 2-3 days for me. I am just about to start another project that requires accurate aggregation for availability and I am not confident this can work in firestore.

Thanks for the update @sambecker seems like there is not a lot of point in me upgrading yet.

As of now, we don’t believe that this is a project specific issue. Our current theory is that this is related to how the GRPC library handles connections that have become unresponsive.

Also started getting these recently 😦

I am getting it with Pub/Sub functions. I also think it is related to cold starts

we are seeing the problem too periodically - fix would be great

"firebase-admin": "^8.4.0",
"firebase-functions": "^3.2.0",

Hi there, it seems like the original bug has been fixed and we’re getting more reports of possibly more than one bug. Collecting them in the wrong old issue can hurt our ability to triage and get your issues taken care of, so I’m going to close it and let you open new bugs that can be resolved with more specific conversations.

The original bug was due to a networking issue that can happen when a server is idle: the connection gets reset and the next request may fail. Originally this was a problem with the gRPC library because it wasn’t handling a clean connection reset. This problem also happens more generally when the FIN packet isn’t sent across the internet due to a number of reasons involving performance and security. If the library isn’t already aware the connection is invalid (e.g. the FIN packet was dropped or the library isn’t handling FIN correctly) the next request will fail. Thanks to the Two Generals Problem, it’s impossible to know if a request failed before or after the server got your request. The library can retry if it knows the request is idempotent (e.g. GET) but it can’t necessarily retry if the request isn’t (e.g. POST). Fortunately, you might know that your code is idempotent. In fact, our guidance is that all cloud functions should be idempotent because you may get more than one invocation. So a retry at the application level should be safe.

Normally you can retry with a simple try/catch. Diving through some of this bug and the internal ones as well, it looks like you couldn’t always catch an error in the gRPC library. If that’s (still) the case, it’s an issue and someone should file a new bug against the gRPC repo or possibly gax-nodejs that exceptions cannot be caught. A foolproof way to handle exceptions anywhere in your codebase is to turn on retries in your functions. This adds a risk that a crash loop will cause indefinite executions, so you’ll need to find some way to drop events-of-death.

I can guarantee you that the event type you’re listening to has no impact on this issue. It’s happening because your function was idle this whole time and we didn’t garbage collect the container so that you could avoid a cold start. Crashes popping up in the Firestore/Datastore library should probably be filed against those SDKs (nodejs-firestore and nodejs-datastore). If you get an obvious networking error, you could also consider filing a bug against the gRPC library instead. You can of course file a bug against this repo as a starting point, but you just might have a slower response as we find the right people and move your bug to the right location. You’re our customers and we care about your experience; this repo just isn’t where the exception lies so it’s not where the fix will come.

I cannot tell you how many days I burned on this issue. In my case, I left a curly bracket out of a document address.

.document('a/{docA}/chat/{docB') // missing final bracket

The error was occurring in a totally different function.

We are still trying to isolate the root cause. We do hear you loud and clear though. Sorry for the trouble and sluggish responses.

Any progress on that? This is a really severe issue ;/

As per docs https://firebase.google.com/docs/functions/retries

Cloud Functions guarantees at-least-once execution of a background function for each event emitted by an event source.

Which is no longer true. On my backend, this leads to more and more inconsistent data, as I’m getting random errors from triggered functions that would normally run without any problems 😦

Has anyone tried to enable retries of a function to defend from this error, will it work for system-level errors?

@damienix, @bottleneck-admin, @jaycosaur, @spoxies, @lamstutz:

Would you mind sending your project IDs and an approximate time window for these errors (including your timezone) to samstern@google.com? Thanks

The issue arises randomly for me. I’m seeing it once/twice a month (out of ~ 10k in a month).

This hits us hard too - the great news is that as we have been around since the betas, we concluded early, that all of cloud function triggers must be run through work queue, that is, for each function invocation, we write a secondary document, that we listen to, which we delete after the function run is done.

This way to check if all the functions have successfully run, one can just check the work queue if it has any documents. If it has -> update the document and it is rerun.

Any other way has proven to be too unreliable so far.

If you keep your function hot using the chron scheduler then you won’t get this error anymore. Its easy if you are using a HTTP function or pub/sub. For http, use chron scheduler to send a DELETE call to the endpoint. Check the method of the incoming request and if its delete just return and end the function. I am waking my function every 1 minute.

I’ve seen this the last two days in functions triggered on asia-northeast1. First execution fails and then all subsequent executions run correctly.

Hi, Same problem here with an onUpdate function; the problem occur randomly…

at Http2CallStream.call.on (/srv/node_modules/@grpc/grpc-js/build/src/client.js:101:45) at emitOne (events.js:121:20) at Http2CallStream.emit (events.js:211:7) at process.nextTick (/srv/node_modules/@grpc/grpc-js/build/src/call-stream.js:71:22) at _combinedTickCallback (internal/process/next_tick.js:132:7) at process._tickDomainCallback (internal/process/next_tick.js:219:9)

Just thought I’d give some input here. My first instance of this error was on the 15th of July and I now get it regularly (but not consistently) across all our functions.

We have a logging system implemented on our functions that essentially tells us when a function is cold started or not, (we ping them to keep warm every minute). Prior to the 15th of July (from 2017 to now! so I have a lot of logs on this) (ie when these errors started happening to me) cloud functions would delete themselves at approx. 3-5 minute intervals from first creation, making the next invocation a cold start. Since the 15th of July this has increased substantially to greater than 5 hours(!!!) and we have seen today a function stay warm for 28 hours (causing a lot of issues to our caching). My guess would be that a previously short running connection is now having to cope with these much much much longer alive periods.

Now unfortunately we do not ping over the weekends, for cost reduction reasons, but on the 12th (and for the last 12+ months) it was cold starting every 3-5 minutes, and on the 15th it now doesn’t cold start for 5+ hours. If this is a new ‘feature’ of cloud functions it is amazing btw! Almost makes them never have to hit cold starts if the keep warm invocations are done right.

I ran into this problem when working with Firestore Point-in-time recovery (PITR) (which is an awesome beta feature! 🎉).

For me, the solution was to specify a timestamp in the transaction that actually resolves to a whole hour exactly, i.e.

const q = firestore.collectionGroup("trips");
  const querySnapshot = await firestore.runTransaction(
    (t) => t.get(q),
    { readOnly: true, readTime: new Timestamp(1696827600, 0) }
  );

✅ works, but

const q = firestore.collectionGroup("trips");
  const querySnapshot = await firestore.runTransaction(
    (t) => t.get(q),
    { readOnly: true, readTime: new Timestamp(1696827601, 0) }
  );

❌ fails.

This is not mentioned in the docs, will leave a comment there. ⛑️

For me I was seeing this error as well as “No connection” and “Deadline exceeded” when performing large numbers of writes individually. Refactoring to use batched writes solved it for me.

I started hitting this error today. Not even on a function, just on a nodejs script with firebase-admin I was running against firestore to do some DB updates.

@bcoe for visibility.

FYI : I will note my experience here for record.

What may causing error code 13

  • Cold start
  • onCall
  • Batch set with FieldValue.increment
  batch.set(
    ref,
    {
      rating: FieldValue.increment(10),
      rated: FieldValue.increment(1)
    },
    { merge: true }
  )
  • a lot of async/await
  • some amount of import

Work around

  • Pre-call next function to warm it up.
  • Avoid use FieldValue.increment // Do normal read and write as possible.
  • Don’t use async/await that much.
  • Watch out for unnecessary import from other file with have unrelated import.

I still facing this on local develop on latest release and really hope it will get better on prod 😉

FWIW, we haven’t had a prod failure since implementing retries (i.e. we had several “events” with this error, but they were successfully retried). I think in general it’s a good idea to implement retries outside of the library code in any case, because network calls are inherently unreliable (though the library could probably retry these particular errors automatically too, depending on how easy/safe that is).

Hi, @dinvlad thank you for sharing this solution, I also have the same problem with several functions and it is really very problematic, so, as your solution seems to work I will try to implement it; and, rather than reinventing the wheel, May I ask you: could you give me a little more detail, possibly a skeleton of one of your functions? For example when you say ‘The body of every function goes inside a" catch all "wrapper’, can I ask you, how do you do that in practice? Thank you in advance for your help; very best. Edit : https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-using-retries-to-build-reliable-serverless-systems https://cloud.google.com/blog/products/serverless/cloud-functions-pro-tips-building-idempotent-functions

Our team decided to implement a general retry mechanism, based (incidentally) on Firestore, to address this and other reliability issues in our background functions, and to simultaneously prevent concurrent processing of the same event (in case of its double-triggering).

  1. We enable retries for the functions.
  2. The body of every function goes inside a “catch all” wrapper, described in the following steps.
  3. If the time since event.timestamp is past a certain threshold, we log the occasion and “successfully” return from the function early, to prevent indefinite retries. Otherwise, continue.
  4. Similarly to what @swftvsn suggested above, we have a special “event” collection for background functions.
  5. In a transaction:
    • Check if a doc with eventId exists in this collection.
    • If the doc exists, we check the status property of that doc:
      • If status === 'running', that means another invocation of the function for this event is already running, so we “successfully” return from the function early, to avoid an extra retry.
      • If status === 'failed', that means a previous invocation failed, so we set status: 'running' and continue function execution (i.e. a retry in this case).
    • If the doc doesn’t exist, we create it with status: 'running'.
  6. Function implementation runs, either successfully or with an error.
  7. In another transaction:
    • Set status: 'failed' if there was an error, and exit with that error.
    • Otherwise, delete the doc with eventId, and exit successfully.
    • retry this transaction unconditionally, if there were any Firestore errors during it (otherwise, the function could “fail” with status still set to running, preventing further retries).

As a result, if the function execution “fails” at any point, incl. its onEvent hook or implementation, it should be retried by GCF runtime automatically (up to the timestamp cut-off). As another bonus, we get the full record of all “permanently failed” invocations in the “event” collection (and with only a short-term storage of “successful” events). This setup also prevents concurrent execution of function implementation for the same eventId. Still, the implementation must be idempotent within a single invocation, to enable safe retries.

As a best practice, one could also store the attempt index in the doc, and wait before returning on each attempt with an exponential backoff + jitter, based on the attempt index.

@swftvsn We made the decision in regards to versioning based on the functionality that we invoke in @grpc/grpc-js. We believe (or at least believed at the time) that the subset of the functionality that we rely upon meets our expectations, as well as the expectations that we conveyed by including in a production-ready library. We understand now that we are not always meeting this bar, and hope that we can close the gap soon.

As you are probably all aware, the GRPC layer we used before was not without issues either. It frequently failed to install (since it required either prebuilt binary or compilation from source) and caused significant startup delays in GCF. On top of that, it’s sheer size made it impossible to be used on some environments that are package-size constrained. With that said, I am hesitant to downgrade our dependency for all of our customers.

That being said, if the old networking layer works for you, you can still upgrade to Firebase Admin 8 and override the network stack:

const admin = require("firebase-admin");
const grpc = require("grpc");

admin.initializeApp();

const firestore = admin.firestore();
firestore.settings({grpc});

After upgrading to latest firebase, we’re hit with this: https://github.com/grpc/grpc-node/issues/1027 - which is direct result of fixing this issue.

@schmidt-sebastian Why on earth does this library, which is production quality, suddenly start depending on a lib that states “This library is currently incomplete and experimental”? I think if you do that, the 3.x line should be marked as experimental too, right?

@david-arteaga Yes, we write always 2 docs using batch, be it an update or create, if we need to invoke cloud function. One is the actual doc, the second one is the work queue entry. Cloud Function triggers are never attached to the main document, always only on the work queue doc. The Cloud Function deletes the work queue entry once it is complete.

As we have some cloud functions doing heavy stuff, we also use this method to build work chains: for an example when new invoice is created a work queue entry is placed:

// firestore path: work-queue/${clientId}/send-invoice-start/${workQueueId}
{
  invoicePath: 'clients/clientIdOne/invoices/invoiceId',
  method: 'electronic'
}

The first part only checks couple of things, adds counters etc. Once this is ready the work queue entry is deleted from that location, and moved to the next in a batch. The next part is to create the PDF out of the textual invoice data:

// firestore path: work-queue/${clientId}/send-invoice-create-pdf/${workQueueId}
{
  invoicePath: 'clients/clientIdOne/invoices/invoiceId',
  method: 'electronic'
}

After that is ready, the work queue entry is next moved according to the sending gateway:

// firestore path: work-queue/${clientId}/send-invoice-gateway-email/${workQueueId} OR
// firestore path: work-queue/${clientId}/send-invoice-gateway-electronic/${workQueueId} OR
// firestore path: work-queue/${clientId}/send-invoice-gateway-print/${workQueueId} OR
// firestore path: work-queue/${clientId}/send-invoice-gateway-snailmail/${workQueueId} OR
// ...
{
  invoicePath: 'clients/clientIdOne/invoices/invoiceId',
  method: 'electronic'
}

This enables us to also monitor which of our integration partners have trouble processing our sends. As all functions are idempodent, we can retry as many times we need and we always have log there to see if Firestore trigger silently is not called. And that happens. Or atleast, we don’t have any errors in logs, and we see work queue entries that are never touched before manual intervention.

AT LEAST READ BELOW:

And finally: Firestore is GA, Cloud Functions are GA but Cloud Function Firestore triggers are still in BETA. So this is to be expected, but we all have to learn it the hard way.

At least I missed it for a long time, that those are in beta and as such may be changed and have bugs. See https://firebase.google.com/docs/functions/firestore-events and notice the beta markings there.

We got this error today “Exception occurred in retry method that was not classified as transient”

Update : This error seems to happen with cloud pubsub and @grpc/grpc-js libraries. Fix : Unfortunately, I need to add Pubsub Editor role to my firebase admin sdk and then it started working again. I’m not sure why the role is suddenly required as it was not the case earlier. Is it like they fixed the earlier issue leading to this strict check or this being a new issue? @thechenky @mdietz94

@madmacc if you are using version control and have a .lock file you should be able to get the same set of packages and their dependencies by checking out the version before the upgrade and running npm ci. Then run a build of your functions and deploy again. If that doesn’t work it sounds like an error unrelated to the upgrade.

Hope that is helpful.

I think there are probably a lot more developers experiencing this issue as it took me a couple of months to even find this thread. For me if the error is thrown then it appears in the cloud functions log without the “note” portion. note: 'Exception occurred in retry method that was not classified as transient' } This makes it very difficult to search on.
Only when I caught the error and logged it did I finally see that message.

@sambecker @google-cloud/firestore": “^2.2.4”, “firebase-admin”: “^8.0.0”, “firebase-functions”: “^3.0.0”

@iamnels1 @swftvsn sorry for the delay - here’s our example implementation for retries with exponential backoff and jitter: https://gist.github.com/dinvlad/a280c5a44165b24960b3442e5205ab30

@dinvlad Really thank you, meanwhile I had set up something that approached that, but with great difficulty and for a less satisfactory result; if you allow it I will use your solution 😄

Hi @thechenky, any updates?

I’m facing the same error as I use datastore (not firestore) right now and I’d like to know the last stable version that is still available in order to do a rollback. It’s very severe and has huge impact on me.

Screen Shot 2019-10-04 at 15 38 42

My code is as below. I’m sure it had been worked until I updated dependencies.

import { Datastore } from '@google-cloud/datastore'
const ds = new Datastore({})

export function saveUserData(id: number, user_data: Data) {
    const key = ds.key([KIND_USER_DATA, ds.int(id)])
    const entity = {
        key: key,
        data: user_data
    }
    ds.save(entity, (err) => {
        if (err != null) {
            console.error(err) // Every invocation gets here after I updated dependencies !!!
        } else {
            console.log("Saved data: " + key.path)
        }
    })
}

package.json

  "dependencies": {
    "@firebase/app": "^0.4.14",
    "@firebase/app-types": "^0.4.3",
    "@google-cloud/datastore": "^4.2.0",
    "express": "4.16.4",
    "firebase-admin": "^8.3.0",
    "firebase-functions": "^3.2.0"
  },

@bottleneck-admin What versions are you using?

@dinvlad Thank you for sharing your solution! I think we want to move to similar setup in the near future.

This issue has wreaked havoc on our application. The firebase team should be communicating this issue to project owners vs owners having to find this thread on their own.

Until a permanent fix is found it seems like the change should be reverted so that people do not upgrade and find out about this the hard way. Especially since it was acknowledged here that:

Our backend team believes that they know what the root cause is, but it might take quite a while for the issue to be fixed in all production environments.

@h-ARTS Do you mind creating a new GitHub issue? This particular issue already has a lot of activity and I suspect that your issue is unrelated. I can help you over there.

If the new issue, can you include for a documentReference.get() and a documentReference.update() call for that particular document? Thanks!

@sambecker Included with the latest version of firebase-admin.

Thank you all for your effort to report and investigate this problem.

We are hitting the same issue but from a different client - @google_cloud/datastore. The exception logs “code: 13” and “method that was not classified as transient”. Stack trace looks the same.

Does it make sense to apply the same hotfix as https://github.com/googleapis/nodejs-firestore/commit/a22ac248b67d9c469af468f1f11e38ff9232dcb9 to datastore_client_config.json ?

See here: https://cloud.google.com/functions/docs/bestpractices/tips#use_global_variables_to_reuse_objects_in_future_invocations

With that said, if you are using admin.firestore() as your entry point, then you will already use Firestore from an essentially global scope since we cache the instance for you. We recommend that you re-use an existing Firestore instances (either by keeping a reference manually or by calling admin.firestore()) so that all Firestore operations share the same GRPC channel.

Our current theory (well mostly just mine) is that this might be related to changes in our networking layer as we switched from a GRPC binary to a pure JavaScript implementation. If you can, do you mind downgrading to firebase-admin 7 and let us know if this issue continues to reproduce? firebase-admin v7 uses @google/cloud-firestore 1.x which relies on the binary GRPC layer.

@sambecker That seems to be the issue from the details I’ve seen in this thread, I’m just watching the thread and hoping for a resolution also since I’ve seen the issue crop up in my projects as well.

Sure you might as well try it. At least it might help reduce the error rates.

Keeping a single instance of your function warm will only reduce the chances of getting the issue, if you get multiple simultaneous requests, firebase will spin up new instances to run your function and you will get a cold start.

Also, writing new functions that are kept warm will not work AFAIK since each function gets its’ own runtime environment / instance.

exports.keepAlive = region('your-region').https.onRequest((req, res) => {
    doYourDbThingHere()
    return;
})

Got same message today over a promise that failed.

I see the same problem on cold start then all fine on subsequent calls. europe-west-1 region I started encountering this problem on 13th August 2019.

Really looking forward to a fix *pray*

I’m also experiencing this issue. I have a cloud function triggered by document create, that returns a transaction to write to a different collection. While testing the function after long periods of time (cold start) the function will fail with the errors in the original post. Subsequent attempts to trigger these functions succeed with no error.

Is there a timeline on this @schmidt-sebastian ? This is a pretty severe issue…