vision: problem with RAM allocating in FasterRCNN

Hello. First of all I’d like to thank for adding object detection models to torchvision, it’s a great help for the community.

However, I encountered a problem while trying to use them. I just copied the example code from https://github.com/pytorch/vision/blob/3d5610391eaef38ae802ffe8b693ac17b13bd5d1/torchvision/models/detection/faster_rcnn.py#L102-L140 to a jupyter notebook and realized that during each execution of model(x) (on CPU) more than 2 GB of RAM is grabbed and not released afterwards. Running del model does not release RAM, only restarting the kernel does.

I met the same problem for the model defined in the following way:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

as stated in https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html

What to do to get rid of this problem? Thanks in advance.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 22 (11 by maintainers)

Commits related to this issue

Add torch::NoGradGuard no_grad_guard to avoid memory leak https://github.com/pytorch/vision/issues/984 — committed to OpenGATE/Gate by tbaudier 5 years ago

Most upvoted comments

@buus2 this is not a leak, and you shouldn’t face OOM errors because of that.

As workarounds, use torch.no_grad(), and maybe use jemalloc when running your programs.

fmassa on Jun 4, 2019

Ok, I think I’ve found the problem (and the solution).

Funnily, it has already happened to me 4 years ago in https://github.com/torch/torch7/issues/229, and the solution in that thread https://github.com/torch/torch7/issues/229#issuecomment-102870888 applies here.

“malloc’ed memory is not always released back to the OS, especially for small object sizes. (Larger allocations will use mmap() / munmap() directly.) The memory can be reused in the same process, though.” - Tudor

“There are better memory allocators out there, such as our own Jason Evans’s jemalloc”, “(jemalloc does make a good faith effort to release memory back to the OS if possible; shrinking the data segment is hard because you can’t defragment without a GC, so jemalloc uses mmap() rather than sbrk())”.

So, if we use jemalloc instead of the default malloc, things should work fine.

Here is an example with malloc:

(segmentation) fmassa@devfair0163:~/work/video_loader$ python -m memory_profiler leak.py
Filename: leak.py

Line #    Mem usage    Increment   Line Contents
================================================
     4  169.598 MiB  169.598 MiB   @profile
     5                             def run():
     6  371.188 MiB  201.590 MiB       model = torchvision.models.resnet50(pretrained=True).eval()
     7  389.578 MiB   18.391 MiB       x = torch.rand(32, 3, 224, 224)
     8 2499.969 MiB 2110.391 MiB       model(x)
     9 2994.383 MiB  494.414 MiB       model(x)
    10 3484.387 MiB  490.004 MiB       model(x)
    11 3925.387 MiB  441.000 MiB       model(x)
    12 4292.887 MiB  367.500 MiB       model(x)
    13 4611.383 MiB  318.496 MiB       model(x)
    14 4856.387 MiB  245.004 MiB       model(x)
    15 5101.383 MiB  244.996 MiB       model(x)
    16 5370.887 MiB  269.504 MiB       model(x)
    17 5319.277 MiB    0.000 MiB       model(x)

and we see that the memory seems to increase over time. Now let’s use jemalloc:

(segmentation) fmassa@devfair0163:~/work/video_loader$ LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 python -m memory_profiler leak.py
Filename: leak.py

Line #    Mem usage    Increment   Line Contents
================================================
     4  171.730 MiB  171.730 MiB   @profile
     5                             def run():
     6  374.852 MiB  203.121 MiB       model = torchvision.models.resnet50(pretrained=True).eval()
     7  393.215 MiB   18.363 MiB       x = torch.rand(32, 3, 224, 224)
     8 1222.484 MiB  829.270 MiB       model(x)
     9 1227.727 MiB    5.242 MiB       model(x)
    10 1226.277 MiB    0.000 MiB       model(x)
    11 1230.836 MiB    4.559 MiB       model(x)
    12 1225.773 MiB    0.000 MiB       model(x)
    13 1231.008 MiB    5.234 MiB       model(x)
    14 1210.758 MiB    0.000 MiB       model(x)
    15 1230.848 MiB   20.090 MiB       model(x)
    16 1215.566 MiB    0.000 MiB       model(x)
    17 1213.285 MiB    0.000 MiB       model(x)

And now we see that the memory stays nicely the same.

TL;DR: this is not a leak in PyTorch nor torchvision, but instead a known (and unintuitive) behavior of malloc.

fmassa on Jun 4, 2019

@buus2 as I mentioned before, just forwarding the model a number of times didn’t show any leak for me, even without torch.no_grad(), so maybe you are holding references in your code to the output? Note that if you do

outputs = []
for i in range(10):
    outputs.append(model(input))

this will hold the full computational graph in memory, and will look like a memory leak.

I’m closing this as this is doesn’t seem to be an issue with the model itself, but let me know if you still face issues

fmassa on Jun 3, 2019

@fmassa you are right, with no_grad RAM does not clog. Could you please suggest a workaround for forwardpropagation during training?

buus2 on Jun 3, 2019

@buus2 can you try

with torch.no_grad():
    output = model(x)

and report back? There might be a reference that’s kept in the forward that I might need to be fixed somewhere

fmassa on Jun 3, 2019

The example model that is present in the documentation is not optimized at all for faster runtime, and indeed uses a lot of memory in the rpn_head because it has a huge convolution there (using 1280 input channels, and with 1280 output channels, for a large input!). You need to use a different rpn_head for it to be more memory efficient (for example, by having a conv going from 1280 channels to 128 or something like that).

The detection / instance segmentation models models in general use an image of minimum size of 800 pixels, and if you are running it in the CPU, it could dispatch to inefficient CPU kernels for the convolution, depending on how you installed PyTorch.

If you need to run it on smaller devices, try reducing the image size via min_size / max_size https://github.com/pytorch/vision/blob/3d5610391eaef38ae802ffe8b693ac17b13bd5d1/torchvision/models/detection/faster_rcnn.py#L57-L58

I’m closing this issue, but let me know if you still face the same problems.

fmassa on Jun 3, 2019