whisper.cpp: CUDA an illegal memory access was encountered

When I try to start any large model I have this error:

CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered

my GPU is 1080 Ti (11GB vRAM). model base.en works.

openai-whisper with large-v2 works without problem using GPU.

I also tried to run quantize model… q5_0 have same problem… q4_0 start but I don’t have any output 😕

on dmesg:

[12337.297750] NVRM: Xid (PCI:0000:04:00): 31, pid=167607, name=main, Ch 00000008, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_8 faulted @ 0x7ffb_7fe03000. Fault is of type FAULT_PDE ACCESS_TYPE_READ

example run:

# ./main -debug -m models/ggml-large-v3.bin -l pl   janina.wav 
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080 Ti, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'janina.wav' (50387302 samples, 3149.2 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = pl, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0

About this issue

  • Original URL
  • State: open
  • Created 7 months ago
  • Reactions: 4
  • Comments: 22 (12 by maintainers)

Most upvoted comments

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

That fixed the issue for my GTX 1060. Thanks!

Any luck with the latest version on master? There have been some changes to the CUDA backend which might have fixed the issue

Testing on Kaggle with Nvidia P100

Latest commit as of today i.e. - f0efd02

The issue seems to affect all models probably (tested large-v3, large-v2, medium, base, tiny) in different ways. For the large models it gives CUDA 700 error and for the smaller models its gives gibberish output.

I tested in Kaggle so that others without access to older gen GPUs can also reproduce the issue.

For model - large-v3

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

Same result for large-v2

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.50 MB
whisper_model_load: model size    = 3117.02 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   30.92 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered
current device: 0

The error is not present in medium, base and tiny models but gibberish outputs all cases -

Medium model-

whisper_init_from_file_with_params_no_state: loading model from '/kaggle/working/whisper.cpp/models/ggml-medium.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1024
whisper_model_load: n_audio_head  = 16
whisper_model_load: n_audio_layer = 24
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1024
whisper_model_load: n_text_head   = 16
whisper_model_load: n_text_layer  = 24
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 4 (medium)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  1551.93 MB
whisper_model_load: model size    = 1551.57 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  132.12 MB
whisper_init_state: kv cross size =  147.46 MB
whisper_init_state: compute buffer (conv)   =   25.54 MB
whisper_init_state: compute buffer (encode) =  170.21 MB
whisper_init_state: compute buffer (cross)  =    7.78 MB
whisper_init_state: compute buffer (decode) =   98.25 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:30.000]  ,,,,,,,,,,.,,,,.,,,,,,,,,,,.,,.,,Inaudible,,,,,,,,,...,,,,,,,,,,,,,,,,, now,,,,,,.,.,,,,,,,,,..,,,,,,,.,,,,,.ING,,,,,,,....,,,,,.,,,,,,,,,,,,,,,,,,,,..,,,,,,,,,,,,,,,.,,,,,,,,,,,,,..,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,:,,,.,,,


whisper_print_timings:     load time =  1267.49 ms
whisper_print_timings:     fallbacks =   5 p /   2 h
whisper_print_timings:      mel time =    28.92 ms
whisper_print_timings:   sample time =  6398.09 ms /  6248 runs (    1.02 ms per run)
whisper_print_timings:   encode time =   234.05 ms /     1 runs (  234.05 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time = 22076.14 ms /  6236 runs (    3.54 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time = 30023.55 ms

Base model -

[00:00:00.000 --> 00:00:30.000]   I I T I I B I I To I I

Tiny model -

[00:00:00.500 --> 00:00:30.000]   ' " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " "

I am hitting this with large-v1 in f16 and q5_1:

CUDA error 700 at ggml-cuda.cu:8335: an illegal memory access was encountered

I do not hit this with q8_0 but I get gibberish. This model works fine on 4774d2fe (with -arch=native for modern CUDA). Will bisect.

edit: ec7a6f04f9c32adec2e6b0995b8c728c5bf56f35 works fine on my P40. b0502836b82944d444f30ea1b5217f69ff6da71b (#1472) either hangs (apparently doing nothing with 100% GPU usage), or crashes with the above assertion failure on line 8224.

Yeah, I’m sorry. Without being able to reproduce, I won’t be able to fix it. It works on my old GTX 1660 and I can’t rent an older GPU to test

./main -m models/ggml-large-v3.bin -f samples/jfk.wav

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1
whisper_backend_init: using CUDA backend
whisper_model_load:     CUDA buffer size =  3117.87 MB
whisper_model_load: model size    = 3117.39 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   32.36 MB
whisper_init_state: compute buffer (encode) =  212.36 MB
whisper_init_state: compute buffer (cross)  =    9.32 MB
whisper_init_state: compute buffer (decode) =   99.17 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


CUDA error 700 at ggml-cuda.cu:8303: an illegal memory access was encountered
current device: 0
Wed Nov 22 10:31:40 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   34C    P0              51W / 210W |     15MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+