nodejs-pubsub: Subscription errors and does not reconnect when connection is cancelled by server

After upgrading @google-cloud/pubsub to escape a cpu issue in @grpc/grpc-js, we found that services using a long-running pubsub subscription began to crash periodically during local development and in GKE.

Issue

When the pubsub service closes the underlying http2 connecton (this happens after 15 minutes), the connected subscription instance emits an error event and does not reconnect.

In a previous version of the library, the subscription instance would reconnect (and wouldn’t emit an error event). However, it looks like the CANCELLED status was removed from the RETRY_CODES list found in src/pull-retry.ts in this commit, which means we skip the the retry block and move on to destroy the stream with a StatusError in src/message-stream.ts here

Since the pubsub service reserves the right to cancel a connection at any time, I would expect this library to handle reconnecting when the pubsub service cancels said connection and not emit an error message on the subscription instance.

The simplest workaround I’ve found so far is to manually add the retry code for CANCELLED back into the RETRY_CODES list before requiring @google-cloud/pubsub proper:

const { RETRY_CODES } = require('@google-cloud/pubsub/build/src/pull-retry');
RETRY_CODES.push(1);

const { PubSub } = require('@google-cloud/pubsub');

Environment details

  • OS: GKE
  • Node.js version: 13.11.0
  • package versions:
    • @google-cloud/pubsub: 1.7.2
    • google-gax: 1.15.2
    • @grpc/grpc-js: 0.7.9

Steps to reproduce

The simplest way to reproduce this issue is to create a subscription to a topic using the pubsub-emulator and wait 15 minutes.

Example reproduction code:

const { PubSub } = require('@google-cloud/pubsub');

const options = {
  projectId: 'local-dev',
  apiEndpoint: 'localhost:8085',
};

const pubsub = new PubSub(options);

const topicName = 'test-topic';
const subscriptionName = 'test-topic-subscription';

const run = async () => {
  const topicDefinition = pubsub.topic(topicName);

  await topicDefinition.get({ autoCreate: true });

  const subscriptionDefinition = topicDefinition.subscription(subscriptionName);

  const [subscription] = await subscriptionDefinition.get({ autoCreate: true });

  subscription.on('message', (message) => {
    console.log('message received', { message });
  });

  subscription.on('error', (error) => {
    console.error(`Error in pubsub subscription: ${error.message}`, {
      error, topicName, subscriptionName,
    });
  });

  subscription.on('close', () => {
    console.error('Pubsub subscription closed', {
      topicName, subscriptionName,
    });
  });

  console.log('handlers added to subscription');
};

run();

Running the above script without sending any messages to the topic yielded this error after 15 minutes:

Error in pubsub subscription: Call cancelled {
  error: StatusError: Call cancelled
      at MessageStream._onEnd ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:234:26)
      at MessageStream._onStatus ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:271:18)
      at ClientDuplexStreamImpl.<anonymous> ([...]/Development/temp/testing-pubsub/node_modules/@google-cloud/pubsub/build/src/message-stream.js:143:44)
      at Object.onceWrapper (events.js:422:26)
      at ClientDuplexStreamImpl.emit (events.js:315:20)
      at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client.js:391:24)
      at Object.onReceiveStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:303:181)
      at Http2CallStream.outputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:114:27)
      at Http2CallStream.maybeOutputStatus ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:153:22)
      at Http2CallStream.endCall ([...]/Development/temp/testing-pubsub/node_modules/@grpc/grpc-js/build/src/call-stream.js:140:18) {
    code: 1,
    details: 'Call cancelled',
    metadata: Metadata { internalRepr: Map(0) {}, options: {} }
  },
  topicName: 'test-topic',
  subscriptionName: 'test-topic-subscription'
}

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 3
  • Comments: 37 (9 by maintainers)

Most upvoted comments

Hey, good(?) news!

2020-10-16T21:50:36.408Z | call_stream | [2] HTTP/2 stream closed with code 8
2020-10-16T21:50:36.408Z | call_stream | [2] ended with status: code=1 details="Call cancelled"
2020-10-16T21:50:36.409Z | subchannel | 127.0.0.1:8085 callRefcount 5 -> 4
STREAM unpipe 2020-10-16T21:50:36.409Z was running for 00:14:58.988

I only saw this with grpc-js, which is also an interesting data point. I’ll be bugging a grpc person about this probably.

We are currently experiencing this same issue using gcloud-sdk docker image. What’s the possible fix ? @feywind

Quick update, that conversation is happening now.

Thank you for working so hard on this issue!

Also @giuliano-barberi-tf do you have a .NET snippet that you’re using for reproducing it? That might be a helpful data point too.

I do not have something easy to share. I’m just seeing this in my application running locally but it’s fairly vanilla just using SubscriberClient. Something like the sample code at https://googleapis.github.io/google-cloud-dotnet/docs/Google.Cloud.PubSub.V1/ will work to repro it and you can see it stops listening after the StartAsync call:

await subscriber.StartAsync(...)
Console.WriteLine("he's dead jim");

After 15 minutes I see the output

Thanks!

Hello, we experienced the same problem in our application. Any news on that?

Sorry to be that guy but we’re still having this issue. Disregard my earlier comment on being more stable, it’s been the same, with the exception of now seeing a console.log of Connection to [object Object] rejected by server because of excess pings but still no event being emitted that would at least let me reconnect manually.

@jeffijoe 2.6.0 should now be available, adding CANCELLED back to the retry codes. This is just an interim thing to get you going again. If the cancels are happening, it might be causing a secondary issue where the emulator becomes unhappy about too many connections (after a long time). I’m still investigating why it’s happening in the first place in #1135.

@feywind here are the full logs as discussed. Looking through them myself around the end, I noticed

D1008 17:40:01.637550000 4599324096 call.cc:734]                       {"created":"@1602193201.637495000","description":"Error received from peer ipv6:[::1]:10050","file":"../deps/grpc/src/core/lib/surface/call.cc","file_line":1055,"grpc_message":"Received RST_STREAM with error code 8","grpc_status":1}

Which AFAIK is CANCELLED.

log.txt

Same, it appears it works when using the Emulator on the host machine directly but not through Docker.

I was able to reproduce this with the nodejs and .NET pubsub client actually with the same setup as you @feywind. It doesn’t happen when I use the pubsub emulator directly in the CLI. Only difference is I am on 10.15.3.

Hi everyone,

We’ve been looking at this today, and it seems like it’s likely to be something specific to the Node library. Unfortunately I’ve also been unable to reproduce it locally, across 3 different Node versions, which leads me to wonder if there’s something going on between Node and GKE.

Re: prioritzation, that is indeed happening right now. If this is actively blocking you, you might try the workaround suggested in the original post (adding CANCELLED back to the retry list), but I can’t really recommend that as a long term fix. I think we were just papering over an issue before.