TorchSharp: Memory leak with requires_grad
I have the following code:
using System;
using TorchSharp;
namespace MemoryLeak
{
class Program
{
static void Main(string[] args)
{
var device = torch.CUDA;
while(true) {
using (var _ = torch.NewDisposeScope()) {
var data = torch.randn(32,1,32,32).to(device).requires_grad_(true);
}
}
}
}
}
This crashes quite quickly with:
Unhandled exception. System.Runtime.InteropServices.ExternalException (0x80004005): CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 9.77 GiB total capacity; 9.46 GiB already allocated; 3.31 MiB free; 9.46 GiB reserved in total by PyTorch)
The equivalent code in pytorch doesn’t crash:
import torch
device = torch.device("cuda")
while(True):
data = torch.randn(32,1,32,32).to(device).requires_grad_(True);
Also if I change the requires_grad
line to the following:
var data = torch.randn(32,1,32,32).requires_grad_(true).to(device);
It doesn’t seem to crash, but maybe it leaks CPU memory instead?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
Commits related to this issue
- Fix to issue #1057 — committed to NiklasGustafsson/TorchSharp by NiklasGustafsson a year ago
- Merge pull request #1058 from NiklasGustafsson/bugs Fix to issue #1057 — committed to dotnet/TorchSharp by NiklasGustafsson a year ago
Yeah… no, it’s the culprit is the
return ResultTensor(res);
and the fact that the managed-code caller ignores the returned handle except to check for errors.Yup, I believe that was it – the C++ implementation of ‘set_requires_grad()’ essentially created a native alias for the tensor, which was never recognized in the managed code that keeps track of disposables. Therefore, the native reference count is kept up, never reaches 0, so the memory is not freed. I’m sure the rewrite you had wastes CPU memory, instead. Anyway, it should work after the next release, but it will still be far slower than setting
requires_grad
anddevice
when creating the tensor, when possible.No, catch doesn’t assume allocation:
Even if it allocated, the ‘Dispose()’ call should take care of it. There’s something else going on… Feel free to debug in parallel! If you do, I suggest setting the batch size to something much larger than 32. With N=30000 and three channels, it takes 21 iteration for it to fail with 8GB of GPU memory (my machine).