vision: Torchvision decode_jpeg memory leak

🐛 Describe the bug

nvJPEG leaks memory and fails with OOM after ~1-2k images.

import torch
from torchvision.io import read_file, decode_jpeg

for i in range(1000): # increase to your liking till gpu OOMs (:
    img_u8 = read_file('lena.jpg')
    img_nv = decode_jpeg(img_u8, device='cuda')

Probably related to first response to https://github.com/pytorch/vision/issues/3848

RuntimeError: nvjpegDecode failed: 5

is exactly the message you get after OOM.

Versions

PyTorch version: 1.9.0+cu111 Is debug build: False CUDA used to build PyTorch: 11.1 ROCM used to build PyTorch: N/A

OS: Arch Linux (x86_64) GCC version: (GCC) 11.1.0 Clang version: 12.0.1 CMake version: version 3.21.1 Libc version: glibc-2.33

Python version: 3.8.7 (default, Jan 19 2021, 18:48:37) [GCC 10.2.0] (64-bit runtime) Python platform: Linux-5.13.8-arch1-1-x86_64-with-glibc2.2.5 Is CUDA available: True CUDA runtime version: 11.4.48 GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti GPU 1: NVIDIA GeForce RTX 2080 Ti GPU 2: NVIDIA GeForce GTX 1080

Nvidia driver version: 470.57.02 cuDNN version: Probably one of the following: /usr/lib/libcudnn.so.8.2.2 /usr/lib/libcudnn_adv_infer.so.8.2.2 /usr/lib/libcudnn_adv_train.so.8.2.2 /usr/lib/libcudnn_cnn_infer.so.8.2.2 /usr/lib/libcudnn_cnn_train.so.8.2.2 /usr/lib/libcudnn_ops_infer.so.8.2.2 /usr/lib/libcudnn_ops_train.so.8.2.2 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] adabelief-pytorch==0.2.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.19.5 [pip3] pytorch-lightning==1.4.5 [pip3] torch==1.9.0+cu111 [pip3] torchaudio==0.9.0 [pip3] torchfile==0.1.0 [pip3] torchmetrics==0.4.1 [pip3] torchvision==0.10.0+cu111 [conda] Could not collect

About this issue

Original URL
State: open
Created 3 years ago
Reactions: 2
Comments: 26 (6 by maintainers)

Most upvoted comments

I just checked if this was fixed in pytorch nightly with cuda 11.6, but i’m still experiencing a memory leak.

python -m pip install torch torchvision --pre --extra-index-url https://download.pytorch.org/whl/nightly/cu116

dschoerk on Jun 22, 2022

It seems that this problem has been solved. My environment is as follows System: Ubuntu22.04 NVIDIA-SMI: 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 torch version: 2.0.1+cu118 torchvision version: 0.15.2+cu118

finally, after waiting for over a year. 😃

langren666 on Jul 26, 2023

Memory leaks on torchvision-0.14.0+cu117 (torchvision-0.14.0%2Bcu117-cp37-cp37m-win_amd64.whl). When will this be fixed?

easy to reproduce:

for i in range(10000):
    torchvision.io.decode_jpeg(torch.frombuffer(jpeg_bytes,dtype=torch.uint8), device='cuda')

Memory leaks didn’t happen when using pynvjpeg 0.0.13, which seems to be built with cuda 10.2

nj = NvJpeg()
nj.decode(jpeg_bytes)

Amadeus-AI on Dec 10, 2022

Hi,

I am using: pytorch 1.11.0+cu113 ubuntu 20.04 LTS python 3.9

I did replace libnvjpeg.90286a3c.so.11 with .so from cuda 11.6.2. However the memory keeps growing indefinitely.

Kubci on Jun 3, 2022

Would you mind telling where do you get this information?

I basically tried all versions I could find from https://pkgs.org/search/?q=libnvjpeg-devel

It seems that there is a small multithread confusion here : https://github.com/pytorch/vision/blob/82f9a187681dade1620bc06b85f317a11aea40dc/torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp#L74

the nvjpeg_handle_creation_flag should be global, not local.

tp-nan on Mar 31, 2022

I had a chance to look at this more: this is an nvjpeg bug. Unfortunately I’m not sure we can do much about it.

It was fixed with CUDA 11.6 but I’m still observing the leak with 11.0 - 11.5.

A temporary fix for linux users is to download the 11.6 nvjpeg.so e.g. from here and to tell ld to use it instead of whatever you currently have installed (using LD_LIBRARY_PATH, or LD_PRELOAD, or something else)

NicolasHug on Feb 25, 2022

@NicolasHug @fmassa Also having this issue. Tried loading images on loop using decode_jpeg directly to GPU. Was able to run a number of images but GPU memory keeps going until it runs out and fails. Memory is fine on CPU. Was wondering if there is a timeline for when this will be fixed. Hoping it will be fixed ASAP as loading directly to GPU is crucial to getting speeds fast enough to run in real time.

Scass0807 on Jan 21, 2022