nodejs-pubsub: Subscription errors and does not reconnect when connection is cancelled by server
After upgrading @google-cloud/pubsub
to escape a cpu issue in @grpc/grpc-js
, we found that services using a long-running pubsub subscription began to crash periodically during local development and in GKE.
Issue
When the pubsub service closes the underlying http2 connecton (this happens after 15 minutes), the connected subscription instance emits an error
event and does not reconnect.
In a previous version of the library, the subscription instance would reconnect (and wouldn’t emit an error
event). However, it looks like the CANCELLED
status was removed from the RETRY_CODES
list found in src/pull-retry.ts
in this commit, which means we skip the the retry block and move on to destroy the stream with a StatusError
in src/message-stream.ts
here
Since the pubsub service reserves the right to cancel a connection at any time, I would expect this library to handle reconnecting when the pubsub service cancels said connection and not emit an error
message on the subscription instance.
The simplest workaround I’ve found so far is to manually add the retry code for CANCELLED
back into the RETRY_CODES
list before requiring @google-cloud/pubsub
proper:
const { RETRY_CODES } = require('@google-cloud/pubsub/build/src/pull-retry');
RETRY_CODES.push(1);
const { PubSub } = require('@google-cloud/pubsub');
Environment details
- OS: GKE
- Node.js version: 13.11.0
- package versions:
@google-cloud/pubsub
:1.7.2
google-gax
:1.15.2
@grpc/grpc-js
:0.7.9
Steps to reproduce
The simplest way to reproduce this issue is to create a subscription to a topic using the pubsub-emulator and wait 15 minutes.
Example reproduction code:
const { PubSub } = require('@google-cloud/pubsub');
const options = {
projectId: 'local-dev',
apiEndpoint: 'localhost:8085',
};
const pubsub = new PubSub(options);
const topicName = 'test-topic';
const subscriptionName = 'test-topic-subscription';
const run = async () => {
const topicDefinition = pubsub.topic(topicName);
await topicDefinition.get({ autoCreate: true });
const subscriptionDefinition = topicDefinition.subscription(subscriptionName);
const [subscription] = await subscriptionDefinition.get({ autoCreate: true });
subscription.on('message', (message) => {
console.log('message received', { message });
});
subscription.on('error', (error) => {
console.error(`Error in pubsub subscription: ${error.message}`, {
error, topicName, subscriptionName,
});
});
subscription.on('close', () => {
console.error('Pubsub subscription closed', {
topicName, subscriptionName,
});
});
console.log('handlers added to subscription');
};
run();
Running the above script without sending any messages to the topic yielded this error after 15 minutes:
Error in pubsub subscription: Call cancelled {
error: StatusError: Call cancelled
at MessageStream._onEnd ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:234:26)
at MessageStream._onStatus ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:271:18)
at ClientDuplexStreamImpl.<anonymous> ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:143:44)
at Object.onceWrapper (events.js:422:26)
at ClientDuplexStreamImpl.emit (events.js:315:20)
at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client.js:391:24)
at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:303:181)
at Http2CallStream.outputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:114:27)
at Http2CallStream.maybeOutputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:153:22)
at Http2CallStream.endCall ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:140:18) {
code: 1,
details: 'Call cancelled',
metadata: Metadata { internalRepr: Map(0) {}, options: {} }
},
topicName: 'test-topic',
subscriptionName: 'test-topic-subscription'
}
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 3
- Comments: 37 (9 by maintainers)
Hey, good(?) news!
I only saw this with grpc-js, which is also an interesting data point. I’ll be bugging a grpc person about this probably.
We are currently experiencing this same issue using gcloud-sdk docker image. What’s the possible fix ? @feywind
Thank you for working so hard on this issue!
I do not have something easy to share. I’m just seeing this in my application running locally but it’s fairly vanilla just using SubscriberClient. Something like the sample code at https://googleapis.github.io/google-cloud-dotnet/docs/Google.Cloud.PubSub.V1/ will work to repro it and you can see it stops listening after the StartAsync call:
After 15 minutes I see the output
Thanks!
Hello, we experienced the same problem in our application. Any news on that?
Sorry to be that guy but we’re still having this issue. Disregard my earlier comment on being more stable, it’s been the same, with the exception of now seeing a
console.log
ofConnection to [object Object] rejected by server because of excess pings
but still no event being emitted that would at least let me reconnect manually.@jeffijoe 2.6.0 should now be available, adding CANCELLED back to the retry codes. This is just an interim thing to get you going again. If the cancels are happening, it might be causing a secondary issue where the emulator becomes unhappy about too many connections (after a long time). I’m still investigating why it’s happening in the first place in #1135.
@feywind here are the full logs as discussed. Looking through them myself around the end, I noticed
Which AFAIK is
CANCELLED
.log.txt
Same, it appears it works when using the Emulator on the host machine directly but not through Docker.
I was able to reproduce this with the nodejs and .NET pubsub client actually with the same setup as you @feywind. It doesn’t happen when I use the pubsub emulator directly in the CLI. Only difference is I am on 10.15.3.
Hi everyone,
We’ve been looking at this today, and it seems like it’s likely to be something specific to the Node library. Unfortunately I’ve also been unable to reproduce it locally, across 3 different Node versions, which leads me to wonder if there’s something going on between Node and GKE.
Re: prioritzation, that is indeed happening right now. If this is actively blocking you, you might try the workaround suggested in the original post (adding CANCELLED back to the retry list), but I can’t really recommend that as a long term fix. I think we were just papering over an issue before.