TorchSharp: Memory leak with requires_grad

I have the following code:

using System;
using TorchSharp;

namespace MemoryLeak
{
    class Program
    {
        static void Main(string[] args)
        {
            var device = torch.CUDA;

            while(true) {
	        using (var _ = torch.NewDisposeScope()) {
                    var data = torch.randn(32,1,32,32).to(device).requires_grad_(true);
	        }
	    }
        }
    }
}

This crashes quite quickly with:

Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 9.77 GiB total capacity; 9.46 GiB already allocated; 3.31 MiB free; 9.46 GiB reserved in total by PyTorch)

The equivalent code in pytorch doesn’t crash:

import torch

device = torch.device("cuda")

while(True):
    data = torch.randn(32,1,32,32).to(device).requires_grad_(True);

Also if I change the requires_grad line to the following:

var data = torch.randn(32,1,32,32).requires_grad_(true).to(device);

It doesn’t seem to crash, but maybe it leaks CPU memory instead?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Commits related to this issue

Most upvoted comments

@NiklasGustafsson Oh of course, you have control over that method. I thought it was implemented in at::

Yeah… no, it’s the culprit is the return ResultTensor(res); and the fact that the managed-code caller ignores the returned handle except to check for errors.

Yup, I believe that was it – the C++ implementation of ‘set_requires_grad()’ essentially created a native alias for the tensor, which was never recognized in the managed code that keeps track of disposables. Therefore, the native reference count is kept up, never reaches 0, so the memory is not freed. I’m sure the rewrite you had wastes CPU memory, instead. Anyway, it should work after the next release, but it will still be far slower than setting requires_grad and device when creating the tensor, when possible.

No, catch doesn’t assume allocation:

#define CATCH(x) \
  try { \
    torch_last_err = 0; \
    x \
  } catch (const c10::Error e) { \
      torch_last_err = strdup(e.what()); \
  } catch (const std::runtime_error e) { \
      torch_last_err = strdup(e.what()); \
  }

Even if it allocated, the ‘Dispose()’ call should take care of it. There’s something else going on… Feel free to debug in parallel! If you do, I suggest setting the batch size to something much larger than 32. With N=30000 and three channels, it takes 21 iteration for it to fail with 8GB of GPU memory (my machine).