vision: Torchvision decode_jpeg memory leak
🐛 Describe the bug
nvJPEG leaks memory and fails with OOM after ~1-2k images.
import torch
from torchvision.io import read_file, decode_jpeg
for i in range(1000): # increase to your liking till gpu OOMs (:
img_u8 = read_file('lena.jpg')
img_nv = decode_jpeg(img_u8, device='cuda')
Probably related to first response to https://github.com/pytorch/vision/issues/3848
RuntimeError: nvjpegDecode failed: 5
is exactly the message you get after OOM.
Versions
PyTorch version: 1.9.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A
OS: Arch Linux (x86_64) GCC version: (GCC) 11.1.0 Clang version: 12.0.1 CMake version: version 3.21.1 Libc version: glibc-2.33
Python version: 3.8.7 (default, Jan 19 2021, 18:48:37) [GCC 10.2.0] (64-bit runtime) Python platform: Linux-5.13.8-arch1-1-x86_64-with-glibc2.2.5 Is CUDA available: True CUDA runtime version: 11.4.48 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce GTX 1080
Nvidia driver version: 470.57.02 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.2.2 /usr/lib/libcudnn_adv_infer.so.8.2.2 /usr/lib/libcudnn_adv_train.so.8.2.2 /usr/lib/libcudnn_cnn_infer.so.8.2.2 /usr/lib/libcudnn_cnn_train.so.8.2.2 /usr/lib/libcudnn_ops_infer.so.8.2.2 /usr/lib/libcudnn_ops_train.so.8.2.2 HIP runtime version: N/A MIOpen runtime version: N/A
Versions of relevant libraries: [pip3] adabelief-pytorch==0.2.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.19.5 [pip3] pytorch-lightning==1.4.5 [pip3] torch==1.9.0+cu111 [pip3] torchaudio==0.9.0 [pip3] torchfile==0.1.0 [pip3] torchmetrics==0.4.1 [pip3] torchvision==0.10.0+cu111 [conda] Could not collect
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 2
- Comments: 26 (6 by maintainers)
I just checked if this was fixed in pytorch nightly with cuda 11.6, but i’m still experiencing a memory leak.
python -m pip install torch torchvision --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu116
It seems that this problem has been solved. My environment is as follows System: Ubuntu22.04 NVIDIA-SMI: 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 torch version: 2.0.1+cu118 torchvision version: 0.15.2+cu118
finally, after waiting for over a year. 😃
Memory leaks on torchvision-0.14.0+cu117 (torchvision-0.14.0%2Bcu117-cp37-cp37m-win_amd64.whl). When will this be fixed?
easy to reproduce:
Memory leaks didn’t happen when using pynvjpeg 0.0.13, which seems to be built with cuda 10.2
Hi,
I am using: pytorch 1.11.0+cu113 ubuntu 20.04 LTS python 3.9
I did replace libnvjpeg.90286a3c.so.11 with .so from cuda 11.6.2. However the memory keeps growing indefinitely.
It seems that there is a small multithread confusion here : https://github.com/pytorch/vision/blob/82f9a187681dade1620bc06b85f317a11aea40dc/torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp#L74
the nvjpeg_handle_creation_flag should be global, not local.
I had a chance to look at this more: this is an nvjpeg bug. Unfortunately I’m not sure we can do much about it.
It was fixed with CUDA 11.6 but I’m still observing the leak with 11.0 - 11.5.
A temporary fix for linux users is to download the 11.6 nvjpeg.so e.g. from here and to tell
ld
to use it instead of whatever you currently have installed (using LD_LIBRARY_PATH, or LD_PRELOAD, or something else)@NicolasHug @fmassa Also having this issue. Tried loading images on loop using
decode_jpeg
directly to GPU. Was able to run a number of images but GPU memory keeps going until it runs out and fails. Memory is fine on CPU. Was wondering if there is a timeline for when this will be fixed. Hoping it will be fixed ASAP as loading directly to GPU is crucial to getting speeds fast enough to run in real time.