nodejs-pubsub: DEADLINE_EXCEEDED makes application not receiving messages at all

Environment details

Node.js version: v12.7.0 npm version: 6.10.0 @google-cloud/pubsub version: “^1.0.0”,

Error:

insertId: "gnr3q1fz7eerd" jsonPayload: { level: "error" message: "unhandledRejection" originalError: { ackIds: [1] code: 4 details: "Deadline exceeded" } }

After receiving this error, the app does not receive messages anymore and we have to exit the application to recreate the kubernetes pod.

Any help would be appreciated!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 12
  • Comments: 99 (29 by maintainers)

Most upvoted comments

@pwrkpop @npomfret @xoraingroup, rather than rolling back to 0.29.1 of @google-cloud/pubsub I recommend one of the following workarounds:

Using grpc, rather than @grpc/grpc-js

  1. inside your project which uses @google-cloud/pubsub, add the dependency grpc (this is the old gRPC transport layer).
  2. use the following code when creating your library:
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

☝️ this same approach can be used for other libraries that use gRPC, e.g.,

const grpc = require('grpc');
const {Logging} = require('@google-cloud/logging');
const logger = new Logging({grpc});

Using the workaround recommended by @Redgwell


const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    initSubscriber();  });
};

initSubscriber();

We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.

I can confirm this after upgrading to PubSub ^1.0.0, all our services stop sending pubsubs after the error occurs.

The full stacktrace is

Retry total timeout exceeded before any response was received Error: Retry total timeout exceeded before any response was received
    at repeat (/app/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:80:31)
    at Timeout.setTimeout [as _onTimeout] (/app/node_modules/@google-cloud/pubsub/node_modules/google-gax/build/src/normalCalls/retries.js:113:25)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10) 

Can I suggest raising the priority on this issue?

We used to see these error messages, we now see these errors in all our projects that use it:

Error: Failed to connect to channel. Reason: Failed to connect before the deadline
    at MessageStream._waitForClientReady (/src/node_modules/@google-cloud/pubsub/build/src/message-stream.js:318:19)

This is an issue hitting a percentage of our users, but is not hitting 100% of library users

this comment makes no sense from optics and I totally agree with @MatthieuLemoine release a proper fix for this or one with this ‘workaround’ built in, asking customers to change their production code with some speculative ‘fix’ is irresponsible

what happens when this is actually fixed and this ends up causing more problems later

the updated documentation does not even mention what kind of workloads would be better to use native or not making the mere suggestion of using it even more confusing and bug prone potentially

I just wanted to give an update before the weekend, we do have a version of @grpc/grpc-js (0.6.9), that all signs are indicating is stable:

  • the timeout issue, that we had managed to reproduce on one system, are no longer occurring.
  • folks running this version of the API have, so far, indicated that they’re not seeing any issues.
  • my colleague @murgatroid99 is feeling confident that he addressed the issues that were leading to the known behavior in this thread.

The reason I was holding off on this update, was that we were doing more stress testing on the system that we had managed to reproduce this issue on.


If anyone is still bumping into issues on 0.6.9, please:

  1. open a new issue on PubSub, so that we can debug your issue in isolation (just in case there’s more than one thing being debugged in this thread).

  2. run your environment with the following environment variables set:

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG
  1. provide your logs to us with the gRPC output immediately before an error occurred

So far, with debug information, @murgatroid99 has been able to address issues almost immediately.

If you do not want to share your logs publicly (understandably) you can open an issue through our issue tracker, and also email me (bencoe [at] google.com so that I can make sure it’s escalated immediately):

https://issuetracker.google.com/savedsearches/559741


Now, if folks start using @grpc/grpc-js@0.6.9, and it becomes apparent that it is not in fact stable, I will take steps to move us back to grpc immediately in @google-cloud/pubsub (until such time that we are confident).

I’d like to 👍 this going back to a P1. Providing a workaround isn’t an acceptable response. Indeed, we have implemented the workaround and rolled out to Prod to see that while it successfully mitigated the lost pub/sub connection, it also introduces a memory leak that requires periodic restart of our k8s pods regardless.

Nope. That didn’t work. 😭 still getting this.

GoogleError: Retry total timeout exceeded before any response was received
    at repeat (/var/www/app/node_modules/google-gax/src/normalCalls/retries.ts:98:23)
    at Timeout._onTimeout (/var/www/app/node_modules/google-gax/src/normalCalls/retries.ts:140:13)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7) {
  code: 4
}

@pwrkpop @npomfret @xoraingroup, rather than rolling back to 0.29.1 of @google-cloud/pubsub I recommend one of the following workarounds:

Using grpc, rather than @grpc/grpc-js

  1. inside your project which uses @google-cloud/pubsub, add the dependency grpc (this is the old gRPC transport layer).
  2. use the following code when creating your library:
const {PubSub} = require('@google-cloud/pubsub');
const grpc = require('grpc');
const pubsub = new PubSub({grpc});

this same approach can be used for other libraries that use gRPC, e.g.,

const grpc = require('grpc');
const {Logging} = require('@google-cloud/logging');
const logger = new Logging({grpc});

Using the workaround recommended by @Redgwell

const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    initSubscriber();  });
};

initSubscriber();

We potentially have a reproduction of the issue described in this thread (thanks @Redgwell for pointing us in the right direction), and will hopefully have a fix out soon that makes either of these workaround unnecessary.

Cant find anyting on creating PubSub with options includeing grpc? I use projectId, and tried to add grpc, but typescript rejects this.

https://googleapis.dev/nodejs/pubsub/1.1.1/PubSub.html

@bcoe

  1. Really low throughput. Probably less than 10 a day at this point.
  2. We just released our service last night, and when we checked the logs this morning, we saw this error. The service was happy for about 3-4 hours before the error showed up in the logs.

I did a quick google and landed here. Locally, I was able to replicate the same error message and behaviour (that is, no longer receiving messages after the error) by:

  1. Start app, subscribe, verify messages are being received.
  2. Disconnect from internet, and wait for the deadline exceeded error
  3. Reconnect to internet, publish messages

To make this a little faster to replicate, I set the streamingOptions timeout value in the pubsub options object to have a shorter streaming connection timeout:

   streamingOptions: {
      timeout: 5000
    }

One thing that might be useful for fixing this issue - if you look at my workaround, inside the initSubscriber function, I create a new PubSub() object. I found that if I didn’t do this inside the initSubscriber function, and instead did it just once when the app starts, the application just kept getting deadline exceeded messages over and over again, even once I’d reconnected to the internet.

That makes me thing there is some state being set in the PubSub object that puts it in a faulted state until it’s recreated. Sorry, I don’t have the time right now to dig into that assertion any further.

@google-cloud/pubsub@1.1.1 definitely not working for me. I regularly see {"code":4,"details":"Failed to connect before the deadline","metadata":{"internalRepr":{},"options":{}}} and then my entire app stops working. The subscribers stop processing messages.

Is there an agreed on workaround in any of the above?

[EDIT]. I’m also seeing this error from time to time

{"ackIds":["KlgRTgQhIT4wPkVTRFAGFixdRkhRNxkIaFEOT14jPzUgKEUSASBuFSFCXhliaFxcdQdQC00geTQnYltFVQhCUnRfcysvV1tbdAVRDR56e2Z0aF8XCSr75KDd7KSXWUZgTbTgwcVHXbKv4JoiZh49WxJLLD5-MDxFQV5AEkw7CURJUytDCw"],"code":4,"details":"Deadline exceeded","metadata":{"internalRepr":{},"options":{}}}

Don’t know if it’s related though.

Errors started happening at 2019-10-09 17:35:28.252 BST (Wed Oct 09 16:35:28 UTC)

Hello, We hit the same problem here since 10/02. We tried upgrading to 0.32.1, and even to 1.1.0. Didn’t solve a thing. We are running in App Engine, so when one of the instances starts hitting the error, it snowballs and errors flow like crazy until the instance gets killed and another instance starts. Then, errors stop flowing for a bit.

Non of our services using pubsub is working anymore either. We are using version 1.1.0 Getting this:

Error: Retry total timeout exceeded before any response was received
    at repeat (/var/www/app/node_modules/google-gax/build/src/normalCalls/retries.js:80:31)
    at Timeout.setTimeout [as _onTimeout] (/var/www/app/node_modules/google-gax/build/src/normalCalls/retries.js:113:25)
    at ontimeout (timers.js:436:11)
    at tryOnTimeout (timers.js:300:5)
    at listOnTimeout (timers.js:263:5)
    at Timer.processTimers (timers.js:223:10)

And this:

Error: Failed to connect to channel. Reason: Failed to connect before the deadline
  File "/var/www/app/node_modules/@google-cloud/pubsub/build/src/message-stream.js", line 318, col 19, in MessageStream._waitForClientReady
    throw new ChannelError(e);

We have to restart our services every 10 minutes because of that.

It also seems til it is storing more and more to disk as disk usage goes up over time.

Hey @bcoe, first time after upgrading googlecloud/pubsub to ^1.0.0. Any workaround to recreate subsciption after this error?

@npomfret I think this “@google-cloud/pubsub”: “0.29.1”, has done magic. I have had the same issue. Only 0.29.1 is the working version.

Okay, I grabbed the latest as you said - here’s the pubsub dependency tree from npm ls:

├─┬ @google-cloud/pubsub@1.1.2
│ ├─┬ @google-cloud/paginator@2.0.1
│ │ ├── arrify@2.0.1 deduped
│ │ └── extend@3.0.2 deduped
│ ├── @google-cloud/precise-date@1.0.1
│ ├── @google-cloud/projectify@1.0.1
│ ├── @google-cloud/promisify@1.0.2
│ ├── @sindresorhus/is@1.2.0
│ ├─┬ @types/duplexify@3.6.0
│ │ └── @types/node@12.7.12
│ ├── @types/long@4.0.0
│ ├── arrify@2.0.1
│ ├── async-each@1.0.3
│ ├── extend@3.0.2
│ ├─┬ google-auth-library@5.4.1
│ │ ├── arrify@2.0.1 deduped
│ │ ├── base64-js@1.3.1
│ │ ├── fast-text-encoding@1.0.0
│ │ ├─┬ gaxios@2.0.1
│ │ │ ├── abort-controller@3.0.0 deduped
│ │ │ ├── extend@3.0.2 deduped
│ │ │ ├─┬ https-proxy-agent@2.2.2
│ │ │ │ ├─┬ agent-base@4.3.0
│ │ │ │ │ └─┬ es6-promisify@5.0.0
│ │ │ │ │   └── es6-promise@4.2.8
│ │ │ │ └── debug@3.2.6 deduped
│ │ │ └── node-fetch@2.6.0 deduped
│ │ ├─┬ gcp-metadata@3.2.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ └─┬ json-bigint@0.3.0
│ │ │   └── bignumber.js@7.2.1
│ │ ├─┬ gtoken@4.1.0
│ │ │ ├── gaxios@2.0.1 deduped
│ │ │ ├─┬ google-p12-pem@2.0.2
│ │ │ │ └── node-forge@0.9.1
│ │ │ ├── jws@3.2.2 deduped
│ │ │ └── mime@2.4.4
│ │ ├─┬ jws@3.2.2
│ │ │ ├─┬ jwa@1.4.1
│ │ │ │ ├── buffer-equal-constant-time@1.0.1
│ │ │ │ ├─┬ ecdsa-sig-formatter@1.0.11
│ │ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ │ └── safe-buffer@5.2.0 deduped
│ │ │ └── safe-buffer@5.2.0 deduped
│ │ └─┬ lru-cache@5.1.1
│ │   └── yallist@3.1.1
│ ├─┬ google-gax@1.7.4
│ │ ├─┬ @grpc/grpc-js@0.6.8
│ │ │ └── semver@6.3.0 deduped
│ │ ├─┬ @grpc/proto-loader@0.5.2
│ │ │ ├── lodash.camelcase@4.3.0
│ │ │ └── protobufjs@6.8.8 deduped
│ │ ├─┬ abort-controller@3.0.0
│ │ │ └── event-target-shim@5.0.1
│ │ ├─┬ duplexify@3.7.1
│ │ │ ├─┬ end-of-stream@1.4.4
│ │ │ │ └── once@1.4.0 deduped
│ │ │ ├── inherits@2.0.4 deduped
│ │ │ ├── readable-stream@2.3.6 deduped
│ │ │ └── stream-shift@1.0.0
│ │ ├── google-auth-library@5.4.1 deduped
│ │ ├── is-stream-ended@0.1.4 deduped
│ │ ├── lodash.at@4.6.0
│ │ ├── lodash.has@4.5.2
│ │ ├── node-fetch@2.6.0
│ │ ├── protobufjs@6.8.8 deduped
│ │ ├─┬ retry-request@4.1.1
│ │ │ ├─┬ debug@4.1.1
│ │ │ │ └── ms@2.1.2 deduped
│ │ │ └─┬ through2@3.0.1
│ │ │   └── readable-stream@2.3.6 deduped
│ │ ├── semver@6.3.0 deduped
│ │ └── walkdir@0.4.1
│ ├── is-stream-ended@0.1.4
│ ├── lodash.snakecase@4.1.1
│ ├── p-defer@3.0.0
│ └─┬ protobufjs@6.8.8
│   ├── @protobufjs/aspromise@1.1.2
│   ├── @protobufjs/base64@1.1.2
│   ├── @protobufjs/codegen@2.0.4
│   ├── @protobufjs/eventemitter@1.1.0
│   ├─┬ @protobufjs/fetch@1.1.0
│   │ ├── @protobufjs/aspromise@1.1.2 deduped
│   │ └── @protobufjs/inquire@1.1.0 deduped
│   ├── @protobufjs/float@1.0.2
│   ├── @protobufjs/inquire@1.1.0
│   ├── @protobufjs/path@1.1.2
│   ├── @protobufjs/pool@1.1.0
│   ├── @protobufjs/utf8@1.1.0
│   ├── @types/long@4.0.0 deduped
│   ├── @types/node@10.14.21
│   └── long@4.0.0

I tried my test loop again - with this code:


const initSubscriber = () => {
  const pubsub = new PubSub();
  const subscription = pubsub.subscription(topic, options);
  subscription.on('message', handler.handleMessage);
  subscription.on('error', e => {
    console.error('Subscriber error handler caught exception. Restarting subscription.', e);
    //initSubscriber(); <------ commented out this, because I'm expecting that the fix means we don't need to do this anymore right?
  });
};

initSubscriber();

So with that running, I pushed some messages, saw the the app was handling them, then disconnected wifi and waited for the deadline exceeded error. I reconnected wifi, pushed some more messages and waited for about 2 minutes. The app didn’t process the messages. After restarting the app, it immediately processed the messages.

So unfortunately, it doesn’t seem that this has fixed things, at least for my test loop of disconnecting the internet connection.

@Redgwell, could I bother you to test with @grpc/grpc-js@0.6.8, if you delete your node_modules and package-lock.json, and run npm install @google-cloud/pubsub, you should get it now. (you can run npm ls post install to confirm).

We believe we’ve potentially addressed the bug requiring you to resubscribe.

@gae123 there’s a chance we have a reproduction, we’re working towards testing on a few @grpc/grpc-js versions, so that we can better isolate where the breakage occurs. Will keep you updated.