MONAI: HausdorffDTLoss leads to GPU memory leak.

Describe the bug Using this loss method with Trainer from transformers library (Pytorch) and YOLOv8 (Pytorch) leads to crash training shortly after start due to cuda out of memory. 16 gb gpu memory, batch size is 1 with 128*128 image. Training crash after ~ 100 iterations.

Environment

Kaggle Notebook, python 3.10.12, last monai version from pip.

Also reproduced this bug under Windows 11 with code from example:

%%time

import torch
import numpy as np
from monai.losses.hausdorff_loss import HausdorffDTLoss
from monai.networks.utils import one_hot

for i in range(0, 30):
    B, C, H, W = 16, 5, 512, 512
    input = torch.rand(B, C, H, W)
    target_idx = torch.randint(low=0, high=C - 1, size=(B, H, W)).long()
    target = one_hot(target_idx[:, None, ...], num_classes=C)
    self = HausdorffDTLoss(include_background=True ,reduction='none', softmax=True)
    loss = self(input.to('cuda'), target.to('cuda'))
    assert np.broadcast_shapes(loss.shape, input.shape) == input.shape

It ate about 5 gb memory, on the GPU consumption graph it looks like a flat line with several rises.

About this issue

Original URL
State: open
Created 4 months ago
Reactions: 1
Comments: 23 (10 by maintainers)

Most upvoted comments

Hi @johnzielke, thanks for the detailed report. Your findings are insightful and indeed point to an interaction between CuPy, garbage collection, and memory deallocation which could be the root cause of the memory leak issue. I agree, looking into this could lead both to resolving the memory leak and potentially offering a substantial performance boost by enabling the loss calculation to run on the GPU. And I tried with the 1.3.0 container and the latest container, with cupy-cuda12x installed, it works well. The issue only happens when uninstall cupy-cuda12x. So the issue might be due to the interdependencies between CUDA and the CuPy library. Specifically, when CuPy is installed, it links with particular CUDA libraries to perform GPU computations. When uninstall CuPy, some CUDA operations may not be correctly performed because they need CuPy to access the GPU. Currently I don’t have time to take a deep look at this issue.

KumoLiu on Mar 22, 2024

I mean test it on your setup, but I guess so. You should also get a nice ~10x performance boost in the calculation of the loss with both cupy and cucim since it will run on the GPU then. There should probably be a warning or at least some more docs that explain how to get the calculation to the GPU

johnzielke on Mar 21, 2024

Chiming in here, since I worked on https://github.com/Project-MONAI/MONAI/pull/7008 trying to make the HausdorffLoss work with cucim.

@SarthakJShetty-path I did switch to #4205 ShapeLoss, but don’t sure that it brings the expected results.

Can you try running this piece of code and posting the results?

import numpy as np
import torch
import matplotlib.pyplot as plt
from monai.networks.utils import one_hot
from monai.losses.hausdorff_loss import HausdorffDTLoss

gpu_consumption = []
steps = []

for i in range(0, 100):
    B, C, H, W = 16, 5, 512, 512
    input = torch.rand(B, C, H, W)
    target_idx = torch.randint(low=0, high=C - 1, size=(B, H, W)).long()
    target = one_hot(target_idx[:, None, ...], num_classes=C)
    self = HausdorffDTLoss(include_background=True, reduction="none", softmax=True)
    loss = self(input.to("cuda"), target.to("cuda"))
    assert np.broadcast_shapes(loss.shape, input.shape) == input.shape
    memory_consumption = torch.cuda.max_memory_allocated(device=None) / (1e9)
    gpu_consumption.append(memory_consumption)
    steps.append(i)
    print(f"GPU max memory allocated: {memory_consumption} GB")

plt.plot(steps, gpu_consumption)
plt.title("GPU consumption (in GB) vs. Steps")
plt.show()

It looks like @KumoLiu got a much different graph from what I received.

I just tested this myself, and I also get an increase in GPU memory usage with each step when running your script. (Windows 11, WSL2, Monai 1.3.0, Pytorch 2.2.1+cuda12.1, Python 3.11.8) If I add a gc.collect() after the assert though, the memory usage stays constant, i.e. the script would be

import gc

import numpy as np
import torch
import matplotlib.pyplot as plt
from monai.networks.utils import one_hot
from monai.losses.hausdorff_loss import HausdorffDTLoss

gpu_consumption = []
steps = []

for i in range(0, 10):
    B, C, H, W = 16, 5, 512, 512
    input = torch.rand(B, C, H, W)
    target_idx = torch.randint(low=0, high=C - 1, size=(B, H, W)).long()
    target = one_hot(target_idx[:, None, ...], num_classes=C)
    self = HausdorffDTLoss(include_background=True, reduction="none", softmax=True)
    loss = self(input.to("cuda"), target.to("cuda"))
    assert np.broadcast_shapes(loss.shape, input.shape) == input.shape
    gc.collect()
    memory_consumption = torch.cuda.max_memory_allocated(device=None) / (1e9)
    gpu_consumption.append(memory_consumption)
    steps.append(i)
    print(f"GPU max memory allocated: {memory_consumption} GB")

plt.plot(steps, gpu_consumption)
plt.title("GPU consumption (in GB) vs. Steps")
plt.show()

This seems to indicate some problem with recognizing unused tensors. Maybe there is some issue with cupy/cucim interoperability and weakrefs created by that? I tried to debug this issue using the pytorch profiler, unfortunately as soon as you enable stack traces, there seems to be a bug with pytorch / kineto profiling that writes bad json traces when enabling stack traces. For reference, this is what I used to test:

import gc

import numpy as np
import matplotlib.pyplot as plt
from monai.networks.utils import one_hot
from monai.losses.hausdorff_loss import HausdorffDTLoss
from torch.profiler import profile, ProfilerActivity
import torch
import torch.nn
import torch.optim
import torch.profiler
import torch.utils.data
gpu_consumption = []
steps = []
def calculate():
    B, C, H, W = 16, 5, 512, 512
    input = torch.rand(B, C, H, W)
    target_idx = torch.randint(low=0, high=C - 1, size=(B, H, W)).long()
    target = one_hot(target_idx[:, None, ...], num_classes=C)
    self = HausdorffDTLoss(include_background=True, reduction="none", softmax=True)
    loss = self(input.to("cuda"), target.to("cuda"))
    assert np.broadcast_shapes(loss.shape, input.shape) == input.shape
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    # Enabling with stack creates bad traces
    # with_stack=True,
    profile_memory=True,
    record_shapes=True,
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./logs/memleak'),
) as prof:
    for i in range(0, 50):
        prof.step()
        calculate()
        # Adding this line fixes the memory leak
        # gc.collect()
        memory_consumption = torch.cuda.max_memory_allocated(device=None) / (1e9)
        gpu_consumption.append(memory_consumption)
        steps.append(i)
        print(f"GPU max memory allocated: {memory_consumption} GB")
prof.export_chrome_trace("trace.json")
plt.plot(steps, gpu_consumption)
plt.title("GPU consumption (in GB) vs. Steps")
plt.show()

When running it without stack_traces, the memory view looks like this without the gc.collect() call:

johnzielke on Mar 20, 2024

Hi @SarthakJShetty-path, I used the same code you shared, the graph looks like:

What’s your PyTorch and MONAI version?

KumoLiu on Feb 23, 2024