cupy: Get segmentation fault with FFT callback function

pr.py

import numpy as np
import cupy as cp

code = r'''
__device__ cufftComplex CB_ConvertInput(
    void *dataIn,
    size_t offset,
    void *callerInfo,
    void *sharedPtr) {
            return make_float2(0.0, 0.1);
}

__device__ cufftCallbackLoadC d_loadCallbackPtr = CB_ConvertInput;
'''

in_arr = cp.random.rand(12, dtype=np.float32)

with cp.fft.config.set_cufft_callbacks(cb_load=code):
    out_arr = cp.fft.fft(in_arr)

print(out_arr)

Run with

CUDA_VISIBLE_DEVICES=0 python pr.py

Then I got

[1]    410392 segmentation fault (core dumped)  CUDA_VISIBLE_DEVICES=0 python pr.py

My system information:

OS                           : Linux-5.12.8-arch1-1-x86_64-with-glibc2.33
Python Version               : 3.9.5
CuPy Version                 : 9.1.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 1.20.0
SciPy Version                : 1.6.3
Cython Build Version         : 0.29.23
Cython Runtime Version       : 0.29.23
CUDA Root                    : /opt/cuda
nvcc PATH                    : /opt/cuda/bin/nvcc
CUDA Build Version           : 11030
CUDA Driver Version          : 11030
CUDA Runtime Version         : 11030
cuBLAS Version               : 11402
cuFFT Version                : 10402
cuRAND Version               : 10204
cuSOLVER Version             : (11, 1, 1)
cuSPARSE Version             : 11500
NVRTC Version                : (11, 3)
Thrust Version               : 101100
CUB Build Version            : 101100
Jitify Build Version         : 60e9e72
cuDNN Build Version          : 8200
cuDNN Version                : 8200
NCCL Build Version           : 20906
NCCL Runtime Version         : 20906
cuTENSOR Version             : None
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GeForce GTX 1080
Device 0 Compute Capability  : 61
Device 0 PCI Bus ID          : 0000:17:00.0
Device 1 Name                : NVIDIA Quadro K600
Device 1 Compute Capability  : 30
Device 1 PCI Bus ID          : 0000:65:00.0

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 23 (12 by maintainers)

Most upvoted comments

@leofang My system is archlinux. The cupy is installed from archlinuxcn/python-cupy. After deleting .cupy folder, I got a warning before the segmentation fault.

nvprune warning : No device code that matched architecture, so stripped out all device code

szsdk on May 31, 2021

@leofang Thanks, everything works now on the 980 workstation.

szsdk on Jun 3, 2021

@leofang I tested both my 1080 machines. And the new patch (cf78fbf86eec6a8a4baef2d5628bb96130c4879) works on both. Cheers. But I cannot compile cupy from source on my 980 machine. It complains that cudnn.h is missing. The dependency on cudnn is optional, isn’t it?

szsdk on Jun 2, 2021

Sure. This is my output.

# make
>>> GCC Version is greater or equal to  5.3.0 <<<
/opt/cuda/bin/nvcc -ccbin g++ -I../../common/inc -m64 -dc -std=c++11 --threads 0 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o simpleCUFFT_callback.o -c simpleCUFFT_callback.cu
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/opt/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o simpleCUFFT_callback simpleCUFFT_callback.o -L/usr/lib/x86_64-linux-gnu -L/usr/lib64 -lcufft_static -lculibos
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/x86_64/linux/release
cp simpleCUFFT_callback ../../bin/x86_64/linux/release

# ./simpleCUFFT_callback
[simpleCUFFT_callback] is starting...
GPU Device 0: "Pascal" with compute capability 6.1

Transforming signal cufftExecC2C
Transforming signal back cufftExecC2C

szsdk on Jun 2, 2021

Let me clarify. This is an another NVIDIA 1080 card. I installed cupy with pip install cupy. The version of cupy is 9.1.0, without your patch. This is my test code.

import numpy as np
import cupy as cp

code = r'''
__device__ cufftComplex CB_ConvertInput(
    void *dataIn,
    size_t offset,
    void *callerInfo,
    void *sharedPtr) {
            return make_float2(1.0, 0.0);
}

__device__ cufftCallbackLoadC d_loadCallbackPtr = CB_ConvertInput;
'''

in_arr = cp.random.rand(12, dtype=np.float32)

with cp.fft.config.set_cufft_callbacks(cb_load=code):
    out_arr = cp.fft.fft(in_arr)

print(cp.__version__)
print(out_arr)
print(cp.fft.fft(cp.ones(12)))

This is the output.

/home/szsdk/anaconda3/lib/python3.8/site-packages/cupy/fft/_fft.py:149: UserWarning: cuFFT plan cache is disabled on CUDA 11.1 due to a known bug, so performance may be degraded. The bug is fixed on CUDA 11.2+.
  cache = get_plan_cache()
9.1.0
[0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j 0.+0.j
 0.+0.j 0.+0.j]
[12.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j  0.+0.j
  0.+0.j  0.+0.j  0.+0.j]

I do not think I get the correct answer. From my understand, the answer should be the last line. I am new to this callback function. Please correct me if I am wrong.

szsdk on Jun 1, 2021

@leofang I have access to another NVIDIA 1080 machine. I get the same warning (cupy 9.10), but not segfault.

szsdk on Jun 1, 2021

Thanks, @szsdk. I will send a patch to fix this.

On CUDA 11.3:

import os
import sys
import subprocess
import cupy as cp


sm_list = cp.cuda.nvrtc.getSupportedArchs()
print(sm_list)
for i in sm_list:
    print(f"on sm_{i}...")
    #p = subprocess.run(['echo', str(i)])
    p = subprocess.run(['nvprune', '-arch', f'sm_{i}', '/usr/local/cuda-11.3/lib64/libcufft_static.a', '-o', 'temp.a'], env=os.environ)
    if p.stderr:
        print(p.stderr)
    else:
        print("OK")

Output:

(35, 37, 50, 52, 53, 60, 61, 62, 70, 72, 75, 80, 86)
on sm_35...
OK
on sm_37...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_50...
OK
on sm_52...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_53...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_60...
OK
on sm_61...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_62...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_70...
OK
on sm_72...
nvprune warning : No device code that matched architecture, so stripped out all device code
OK
on sm_75...
OK
on sm_80...
OK
on sm_86...
OK

leofang on May 31, 2021