nodejs-pubsub: Many parallel subscriptions in one process => DEADLINE_EXCEEDED upon ack

Environment details

OS: GKE
Node.js version: 10.13.0
@google-cloud/pubsub version: 0.28.1

Steps to reproduce

This is a follow-up report to #240, the symptoms are the same however I think I have a better understanding of what might go wrong this time.

I have made another take on finally upgrading the node pubsub library from v0.18.0 to v0.28.1.

I am currently running a couple of pods that subscribe to several pubsub topics.
Most of the pods subscribe to only one or two topics, those work fine with the updated library.
I have one pod that subscribes to many subscriptions, this one shows problems, see below.

Here’s what the problematic pod does:

Starts up by initializing 23 different (pre-existing) subscriptions on 23 different topics.
Starts pulling messages that sit in these subscriptions, these are low-frequency subscriptions, so we’re talking about 0-500 messages that sit in the subscriptions initially.
Processes those messages in batches and acks() all of them, processing of a batch takes ~500ms.
After a few seconds (I’ve seen 5-12), I receive a DEADLINE_EXCEEDED error event on one of the subscriptions where acks have been sent (and I crash the process).

Here’s the error:

Error: Failed to "acknowledge" for 6 message(s). Reason: 4 DEADLINE_EXCEEDED: Deadline Exceeded
    at AckQueue.<anonymous> (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:162:23)
    at Generator.throw (<anonymous>)
    at rejected (/app/node_modules/@google-cloud/pubsub/build/src/message-queues.js:20:65)

For the affected subscriptions, the num_undelivered_messages stackdriver metric just keeps on rising:

Stackdriver also indicates that there are indeed streaming_pull_message_operation_count operations (they keep on rising because my pod keeps on crashing and restarting, thus pulling the rising amount of unacked messages over and over again):

However stackdriver doesn’t show any pull_ack_message_operation_count operations (ignore the spike to the right, this is where I started the pod with only one subscription where it worked):

Just for comparison, with version v0.18.0 of this client library, that I had been using before, it was still using streamingAcks (see discussion in #240) and everything was working well, even with the 23 parallel subscriptions in one node process:

All screenshots above show the same subscription and the same timeframe, at time ~4:30pm I deployed the new version, the point where the amount of undelivered messages starts rising.

To me it looks like the amount of parallel subscriptions determines whether acks work as expected or whether I consistently receive DEADLINE_EXCEEDED errors as a result to the ack batches. It’s interesting that acks either work for all subscriptions handled by the affected process or none.

I doubt that it is CPU related because the process does basically only I/O (pipe the messages from Pubsub to BigQuery) and the CPU-graphs of GKE also show no significant CPU usage. CPU usage of the same process with v0.18.0 of the library was also <1%.

Is the number of parallel subscriptions that can be handled in a single node process (all subscription message handlers registered with a single new PubSub() client instance) somehow limited?

Any pointer into what direction I should be looking to would be highly appreciated!

(Next thing I’d like to try out is see if I can narrow down the number of parallel subscriptions in one process where it starts to break… However due to this npm registry issue I have currently trouble deploying new code to the GKE cluster: https://status.npmjs.org/incidents/3p30y0pkblsp)

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 24 (18 by maintainers)

Most upvoted comments

@sduskis totally fine with a safe limit of 50. Just whatever error message that is more descriptive than DEADLINE_EXCEEDED and that throws immediately and not only after a while when the first acks start to fail… 😉

I furthermore suggest to move option 4 into a separate issue and keep this one open only until the error message has been added.

ctavan on May 22, 2019

@callmehiphop @bcoe I’ve extended my reproduction case a bit to show that:

The limit is indeed per new PubSub() instance. If I create a new PubSub() instance for each subscription I can subscribe to 20 subscriptions (or more) with maxStreams: 5 (default) equaling >= 100 concurrent grpc streams.
Publishing on the same new PubSub() instance doesn’t seem to have any effect on the observed subscription behavior. No matter whether I published on the same new PubSub() instance or not before, I can spawn exactly 99 concurrent grpc streams and as soon as I spawn the 100th I get DEADLINE_EXCEEDED errors.

ctavan on Apr 8, 2019