cupy: Multithreaded cufft memory leak
Description
The cufft plan cache does not appear to deallocate GPU memory during thread clean up (GC). If I launch cp.fft.fft on a thread and then exit/join the thread, there is residual memory still allocated on the GPU.
To Reproduce
from concurrent.futures import ThreadPoolExecutor
import cupy as cp
def task():
data = cp.ones(2**20)
return cp.fft.fft(data)
while True:
with ThreadPoolExecutor(max_workers=3) as executor:
tasks = [executor.submit(task) for _ in range(10)]
cp.get_default_memory_pool().free_all_blocks()
cp.get_default_pinned_memory_pool().free_all_blocks()
cp.fft.config.get_plan_cache().clear()
input("Check nvidia-smi memory usage... Press any key to run again")
Installation
Wheel (pip install cupy-***)
Environment
OS : Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.29
Python Version : 3.8.10
CuPy Version : 10.0.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 1.21.5
SciPy Version : 1.7.3
Cython Build Version : 0.29.24
Cython Runtime Version : None
CUDA Root : /usr/local/cuda
nvcc PATH : None
CUDA Build Version : 11040
CUDA Driver Version : 11040
CUDA Runtime Version : 11040
cuBLAS Version : (available)
cuFFT Version : 10502
cuRAND Version : 10205
cuSOLVER Version : (11, 2, 0)
cuSPARSE Version : (available)
NVRTC Version : (11, 4)
Thrust Version : 101201
CUB Build Version : 101201
Jitify Build Version : 60e9e72
cuDNN Build Version : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version : 21104
NCCL Runtime Version : 21104
cuTENSOR Version : None
cuSPARSELt Build Version : None
Device 0 Name : Quadro RTX 5000
Device 0 Compute Capability : 75
Device 0 PCI Bus ID : 0000:01:00.0
Additional Information
Runtime is within docker image nvidia/cuda:11.4.2-runtime-ubuntu20.04
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (15 by maintainers)
OK. I tested it.
cupy=10.4.0runs out of memory with the script above, but your patched branch does not run out of memory. I donβt have a real world test because I already patched my projects to work around this issue. Thanks @leofang! πI tried to test it, but I am having trouble compiling! I keep getting compiler errors (related to syntax not linking or missing libraries). Maybe I should just clone the conda-forge cupy-feedstock and build in their docker container! π
@leofang minimal working example.