grpc-node: Potential memory leak in resolver-dns

Problem Description Previously, we had an issue where upgrading from @grpc/grpc-js from 1.3.x to 1.5.x introduced a channelz memory leak (fixed in this issue for 1.5.10)

Upgrading to 1.5.10 locally seems to be fine and I have noticed no issues. However, when we upgraded our staging/production environments, a memory leak seems to come back with the only difference being updating from @grpc/grpc-js 1.3.x to 1.5.10.

Using Datadog’s continuous profiler, I wasn’t sure if this was the root issue, but there is definitely a growing heap.

Again, we are running a production service with a single grpc-js server that creates multiple grpc-js clients. The clients are created and destroyed using lightning-pool.

Channelz is disabled when we initialize the server/clients with 'grpc.enable_channelz': 0 (for server and clients)

Reproduction Steps The reproduction steps is still the same as before, except I guess this time the service is under staging/production load?

Create a single grpc-js server that calls grpc-js clients as needed from a pool resource with channelz disabled. In our case, the server is running and when requests are made, we acquire a client via the pool (factory created once as a singleton) to make a request. These should be able to handle concurrent/multiple requests.

Environment

  • OS Name: macOS (locally testing) and running on AWS EKS clusters (production)
  • Node Version: 14.16.0
  • Package Name and Version: @grpc/grpc-js@1.5.10

Additional Context Checking out the profiler with Heap Live Size, it looks like there is a growing heap size for backoff-timeout.js, resolver-dns.js, load-balancer-child-handler.js, load-balancer-round-robin.js and channel.ts. I let it run for about 2.5 hours and I am comparing the heap profiles from the first 30mins and the last 30 minutes to see what has changed. When comparing with @grpc/grpc-js@1.3.x, these look like they aren’t used.

I see that 1.6.x made some updates to some timers, was wondering if it could be related?

Happy to provide more context or help as needed.

NOTE: Clarifying the graph, the start/end time of the problem starts within the highlighted intervals. Everything else is from a different process and rolling the package back.

Screen Shot 2022-04-05 at 3 29 42 PM

(Detail view of the other red section from above) Screen Shot 2022-04-05 at 3 41 34 PM

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (12 by maintainers)

Most upvoted comments

The requested tests have been added in #2105.

@sam-la-compass Can you check if the latest version of grpc-js fixes the original bug for you?