grpc-node: Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted

Problem description

We use @googlecloud/datastore dependency in our code. Since some version of grpc-js (currently we’re on 0.6.9) we started to receive the following error in our production and staging backends, also in our cron jobs that stream over ~100K / 1M records in Datastore (sometimes after ~5 minutes, sometimes after ~30 minutes). Error details as seen in our Sentry:

Error: 8 RESOURCE_EXHAUSTED: Bandwidth exhausted
    at Object.callErrorFromStatus (/root/repo/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Http2CallStream.<anonymous> (/root/repo/node_modules/@grpc/grpc-js/build/src/client.js:96:33)
    at Http2CallStream.emit (events.js:215:7)
    at Http2CallStream.EventEmitter.emit (domain.js:476:20)
    at /root/repo/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22
    at processTicksAndRejections (internal/process/task_queues.js:75:11) {
  code: 8,
  details: 'Bandwidth exhausted',
  metadata: Metadata { internalRepr: Map {}, options: {} },
  note: 'Exception occurred in retry method that was not classified as transient'
}

Reproduction steps

Very hard to give reproduction steps. Stack trace is not “async”, in a way that it doesn’t like to exact place where it was called in the code (like it would have done with return await). We know that in the backend service we’re doing all kinds of Datastore calls, but NOT stream. In cron jobs we DO stream as well as other (get, save) api calls.

Environment

Backend service is in AppEngine (was in Node10, now in Node12 beta, which runs node 12.4.0)
@grpc/grpc-js@0.6.9

We definitely did NOT see this error in 0.5.x, but I don’t remember exactly since which version of 0.6.x it started to appear.

Additional context

Error happens quite seldomly, maybe ~1-2 times a days on a backend service that serves ~1M requests a day. But when it fails - it fails hard, it’s impossible to try/catch such error, and usually one “occurrence” of such error fails multiple requests from our clients. For example, last night it failed in our staging environment that was running e2e tests (many browsers open in parallel) which produced ~480 errors in one spike. So, looks like this error does not “recover the connection” very quickly.

Another annoying thing of this error is that if it happens inside a long-running cron job that streams some table - we have no way to recover from that error and the whole cron job becomes “failed in the middle” (imagine running a DB migration that fails in the middle in a non-transactional way). So, if our cron job needs to run for ~3 hours and fails after 2 hours - we have no choice but to restart it from the very beginning (paying all the datastore costs).

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 1
Comments: 40 (16 by maintainers)

Commits related to this issue

Use "grpc" (C-based implementation) instead of "@grpc/grpc-js" (pure JavaScript implementation) due to issue/error under high load (See: https://github.com/grpc/grpc-node/issues/1158) — committed to ndidplatform/api by re7eal 2 years ago
Merge #13237 13237: Update @grpc/grpc-js r=Frassle a=Frassle <!--- Thanks so much for your contribution! If this is your first time contributing, please ensure that you have read the [CONTRIBUTING... — committed to pulumi/pulumi by bors[bot] a year ago

Most upvoted comments

I’ve upgraded to node v13 and haven’t had any luck in supressing these errors. Is this the case for anyone else? Running @grpc/grpc-js@0.7.9 for reference

isaiah-coleman on Apr 21, 2020

After moving from node 12 to node 13, we no longer have the issue. It’s most likely related to https://github.com/nodejs/node/commit/18a1796e3cbcf2bcf5303d21de7ff5a2a6fa3bb1 and https://github.com/nodejs/node/commit/8a4a1931b8b98242abb590936c31f0c20dd2e08f

The fix is scheduled for v10.17.1 and v12.13.2. https://github.com/nodejs/node/pull/30684

MiLk on Dec 13, 2019

Had the same problem. Fixed by the following code:

   const server = new grpc.Server({
        'grpc-node.max_session_memory': Number.MAX_SAFE_INTEGER
    });

paulish on May 25, 2022

I have published grpc-js 1.7.2 with Number.MAX_SAFE_INTEGER as the default value for that option on the server too.

murgatroid99 on Oct 14, 2022

PR #1666 has been published in grpc-js version 1.3.0. It adds a channel option grpc-node.max_session_memory. Setting that to a value larger than 10 may stop these RESOURCE_EXHAUSTED errors from happening in some cases. The specific value to set probably depends on your specific workload.

murgatroid99 on Apr 29, 2021

We started to observe the same issue since yesterday. We have a client using grpc-js that we updated from 0.5.2 to 0.6.11 about 3 weeks ago, and a server using grpc-go that we updated from v1.21.1 to v1.25.1 about 3 weeks ago. What is strange is that we didn’t update anything in the past days, but the issue started to show only now.

We get a lot of 8 RESOURCE_EXHAUSTED: Bandwidth exhausted and a few 13 INTERNAL errors on the client after running for a few hours.

We have no error logs on the backend which could help us track down this issue. We used to have a keepalive enforcement policy set with a 25 seconds MinTime. It has been temporarily disabled.

MiLk on Dec 11, 2019

UPD. Don’t want to complicate this github issue with different errors, but trying now the same cron job multiple times gives me different error. Don’t know how it is related to the original one. Tell me if I should open another issue for it or not. Stack trace:

Error: 13 INTERNAL: 
    at Object.callErrorFromStatus (/Users/kirill/Idea/NCBackend3/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call.js:30:26)
    at Http2CallStream.<anonymous> (/Users/kirill/Idea/NCBackend3/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/client.js:96:33)
    at Http2CallStream.emit (events.js:215:7)
    at Http2CallStream.EventEmitter.emit (domain.js:476:20)
    at /Users/kirill/Idea/NCBackend3/node_modules/google-gax/node_modules/@grpc/grpc-js/build/src/call-stream.js:75:22
    at processTicksAndRejections (internal/process/task_queues.js:75:11) {
  code: 13,
  details: '',
  metadata: Metadata { internalRepr: Map {}, options: {} },
  note: 'Exception occurred in retry method that was not classified as transient'
}

Currently I have a script that reproduced this error after ~7 minutes 3 times in a row (will try more times, but looks “consistently reproducible”). All this script does is it opens 16 streams in parallel, reading data from Datastore and saving it (streaming) into a gzipped file.

This is when running on my local MacOS.

yarn list @grpc/grpc-js

├─ @grpc/grpc-js@0.6.10
└─ google-gax@1.7.5
   └─ @grpc/grpc-js@0.6.9

kirillgroshkov on Nov 4, 2019