llama.cpp: CLBlast fails on context lengths above 2048 after merging #4256

Inference with CLBlast fails with a segfault after the commit that merged https://github.com/ggerganov/llama.cpp/pull/4256 on context sizes above 2k when all GPU layers are offloaded.

Command line: C:\test\llama-b1601-bin-win-clblast-x64>main.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 4096 -b 512 -n 32 -ngl 33 -f C:\test\test.txt

main: build = 1601 (5a7d312)
main: built with MSVC 19.37.32826.1 for x64
main: seed  = 1701534899
ggml_opencl: selecting platform: 'NVIDIA CUDA'
ggml_opencl: selecting device: 'NVIDIA GeForce RTX 2060'
ggml_opencl: device FP16 support: false

Result: Prompt processing starts, and then segfaults halfway around the 2k token mark, before generation begins. Only if the prompt is short enough (less than 2k tokens) it appears to work.

About this issue

Original URL
State: closed
Created 7 months ago
Reactions: 1
Comments: 15 (10 by maintainers)

Commits related to this issue

fixed segfault with clblast by reversing commit in issue https://github.com/ggerganov/llama.cpp/issues/4296 — committed to LostRuins/koboldcpp by LostRuins 7 months ago

Most upvoted comments

No problem - thank you very much for reporting this issue

ggerganov on Dec 3, 2023

Sorry I couldn’t help more with the debugging. Anyway https://github.com/ggerganov/llama.cpp/pull/4307 seems to work for me. The segfault no longer occurs.

LostRuins on Dec 3, 2023

Please confirm that #4307 works

ggerganov on Dec 3, 2023

@AlpinDale When running with ASAN, you need to add this env variable: ASAN_OPTIONS=protect_shadow_gap=0 ./main .. to go through these bogus errors on init.

Doing that, I now get the following sanitizer errors, confirming a bug in ggml.c that I introduced in #4256

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1,100, frequency_penalty = 0,000, presence_penalty = 0,000
	top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
generate: n_ctx = 4096, n_batch = 512, n_predict = 32, n_keep = 0


 GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007

Copyright © 2007 Free Software Foundation, Inc. <https://fsf.org/>

Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.

Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.

The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.

When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.

To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.

For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.

Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.

For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.

Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.

Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.

The precise terms and conditions for copying, distribution and modification follow.

TERMS AND CONDITIONS
0. Definitions.
“This License” refers to version 3 of the GNU General Public License.

“Copyright” also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.

“The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.

To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.

A “covered work” means either the unmodified Program or a work based on the Program.

To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes=================================================================
==364805==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000fc6580 at pc 0x5620bf18802b bp 0x7fe2cf3f2840 sp 0x7fe2cf3f2830
WRITE of size 4 at 0x62d000fc6580 thread T28
    #0 0x5620bf18802a in ggml_vec_cpy_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:1158
    #1 0x5620bf22385d in ggml_compute_forward_soft_max_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:10614
    #2 0x5620bf2244aa in ggml_compute_forward_soft_max /home/ggerganov/development/github/llama.cpp/ggml.c:10668
    #3 0x5620bf25fbbe in ggml_compute_forward /home/ggerganov/development/github/llama.cpp/ggml.c:13905
    #4 0x5620bf27e361 in ggml_graph_compute_thread /home/ggerganov/development/github/llama.cpp/ggml.c:15860
    #5 0x7fe42b494ac2 in start_thread nptl/pthread_create.c:442
    #6 0x7fe42b526a3f  (/lib/x86_64-linux-gnu/libc.so.6+0x126a3f)

0x62d000fc6580 is located 0 bytes to the right of 33152-byte region [0x62d000fbe400,0x62d000fc6580)
allocated by thread T0 here:
    #0 0x7fe42ccb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5620bf14270a in __gnu_cxx::new_allocator<unsigned char>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
    #2 0x5620bf11ee72 in std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
    #3 0x5620bf0ea3eb in std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
    #4 0x5620bf0a3ffb in std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned long) /usr/include/c++/11/bits/vector.tcc:635
    #5 0x5620bf06d1ab in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned long) /usr/include/c++/11/bits/stl_vector.h:940
    #6 0x5620bef398d0 in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:668
    #7 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #8 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #9 0x5620bedd4eb5 in llama_init_from_gpt_params(gpt_params&) /home/ggerganov/development/github/llama.cpp/common/common.cpp:996
    #10 0x5620bed77fc5 in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:187
    #11 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Thread T28 created by T0 here:
    #0 0x7fe42cc58685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x5620bf282b56 in ggml_graph_compute /home/ggerganov/development/github/llama.cpp/ggml.c:16094
    #2 0x5620bef3994f in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:672
    #3 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
    #4 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
    #5 0x5620bed8b2fa in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:605
    #6 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ggerganov/development/github/llama.cpp/ggml.c:1158 in ggml_vec_cpy_f32
Shadow bytes around the buggy address:
  0x0c5a801f0c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0c90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c5a801f0ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c5a801f0cb0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0ce0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c5a801f0d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==364805==ABORTING

ggerganov on Dec 3, 2023

I’m able to reproduce - looking into it

ggerganov on Dec 3, 2023

Can confirm this happens for me too. Same command and prompt as @LostRuins. Hardware is RTX 2070S and Intel i7-8700, and I’m using Linux 6.5.9. Happens with -ngl 0 and -ngl 99. The error I get is:

free(): invalid next size (normal)
zsh: IOT instruction (core dumped)

Different error followed by a segfault with -ngl 32 (7B GGUF model):

ggml_opencl: clSetKernelArg(*to_fp32_cl, 0, sizeof(cl_mem), &d_Q) error -38 at ggml-opencl.cpp:1733

AlpinDale on Dec 3, 2023