cupy: Multithreaded cufft memory leak

Description

The cufft plan cache does not appear to deallocate GPU memory during thread clean up (GC). If I launch cp.fft.fft on a thread and then exit/join the thread, there is residual memory still allocated on the GPU.

To Reproduce

from concurrent.futures import ThreadPoolExecutor
import cupy as cp

def task():
    data = cp.ones(2**20)
    return cp.fft.fft(data)

while True:
    with ThreadPoolExecutor(max_workers=3) as executor:
        tasks = [executor.submit(task) for _ in range(10)]
        
    cp.get_default_memory_pool().free_all_blocks()
    cp.get_default_pinned_memory_pool().free_all_blocks()
    cp.fft.config.get_plan_cache().clear()
    
    input("Check nvidia-smi memory usage... Press any key to run again")

Installation

Wheel (pip install cupy-***)

Environment

OS                           : Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.29
Python Version               : 3.8.10
CuPy Version                 : 10.0.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.21.5
SciPy Version                : 1.7.3
Cython Build Version         : 0.29.24
Cython Runtime Version       : None
CUDA Root                    : /usr/local/cuda
nvcc PATH                    : None
CUDA Build Version           : 11040
CUDA Driver Version          : 11040
CUDA Runtime Version         : 11040
cuBLAS Version               : (available)
cuFFT Version                : 10502
cuRAND Version               : 10205
cuSOLVER Version             : (11, 2, 0)
cuSPARSE Version             : (available)
NVRTC Version                : (11, 4)
Thrust Version               : 101201
CUB Build Version            : 101201
Jitify Build Version         : 60e9e72
cuDNN Build Version          : (not loaded; try `import cupy.cuda.cudnn` first)
cuDNN Version                : (not loaded; try `import cupy.cuda.cudnn` first)
NCCL Build Version           : 21104
NCCL Runtime Version         : 21104
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : Quadro RTX 5000
Device 0 Compute Capability  : 75
Device 0 PCI Bus ID          : 0000:01:00.0

Additional Information

Runtime is within docker image nvidia/cuda:11.4.2-runtime-ubuntu20.04

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

OK. I tested it. cupy=10.4.0 runs out of memory with the script above, but your patched branch does not run out of memory. I don’t have a real world test because I already patched my projects to work around this issue. Thanks @leofang! πŸ˜„

I tried to test it, but I am having trouble compiling! I keep getting compiler errors (related to syntax not linking or missing libraries). Maybe I should just clone the conda-forge cupy-feedstock and build in their docker container! πŸ˜†

@leofang minimal working example.

from concurrent.futures import ThreadPoolExecutor

import cupy as cp

def allocate(index):
    return cp.random.rand(5, 256, 256)

def do_fft(x, index):
    return cp.fft.fft2(x)

def gather(x, index):
    return cp.asnumpy(x)

def main():

    num_device = 2
    indices = list(range(num_device))

    # Runs out of memory
    for _ in range(1000000):
        with ThreadPoolExecutor(num_device) as pool:
            inputs = pool.map(allocate, indices)
            transformed = pool.map(do_fft, inputs, indices)
            outputs = pool.map(gather, transformed, indices)

    # Will not run out of memory
    for _ in range(1000000):
        inputs = map(allocate, indices)
        transformed = map(do_fft, inputs, indices)
        outputs = map(gather, transformed, indices)


if __name__ == "__main__":
    main()