llama.cpp: [SYCL] Segmentation fault after #5411
System: Arch Linux, CPU: Intel i3 12th gen GPU: Intel Arc A750 RAM: 16GB
llama.cpp version: b2134
Previously the build was failing with -DLLAMA_SYCL_F16=ON which has been fixed in #5411. Upon running this build, it crashes with segmentation fault.
logs:
bin/main -m ~/Public/Models/Weights/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "hello " -n 1000 -ngl 99
Log start
main: build = 2134 (099afc62)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed = 1707789832
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16: yes
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
Device 0: Intel(R) Arc(TM) A750 Graphics, compute capability 1.3,
max compute_units 448, max work group size 1024, max sub group size 32, global mem size 8096681984
Device 1: Intel(R) FPGA Emulation Device, compute capability 1.2,
max compute_units 4, max work group size 67108864, max sub group size 64, global mem size 16577347584
Device 2: 12th Gen Intel(R) Core(TM) i3-12100F, compute capability 3.0,
max compute_units 4, max work group size 8192, max sub group size 64, global mem size 16577347584
Device 3: Intel(R) Arc(TM) A750 Graphics, compute capability 3.0,
max compute_units 448, max work group size 1024, max sub group size 32, global mem size 8096681984
Using device 0 (Intel(R) Arc(TM) A750 Graphics) as main device
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from /home/tensorblast/Public/Models/Weights/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.block_count u32 = 22
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 11: general.file_type u32 = 15
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type q4_K: 135 tensors
llama_model_loader: - type q6_K: 21 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 636.18 MiB (4.85 BPW)
llm_load_print_meta: general.name = tinyllama_tinyllama-1.1b-chat-v1.0
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 2 '</s>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors: buffer size = 601.02 MiB
llm_load_tensors: CPU buffer size = 35.16 MiB
.....................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: KV buffer size = 11.00 MiB
llama_new_context_with_model: KV self size = 11.00 MiB, K (f16): 5.50 MiB, V (f16): 5.50 MiB
llama_new_context_with_model: CPU input buffer size = 5.01 MiB
zsh: segmentation fault (core dumped) bin/main -m -p "hello " -n
~The build without -DLLAMA_SYCL_F16=ON works.~
Confirmed: This crash started happening after #5411
About this issue
- Original URL
- State: closed
- Created 5 months ago
- Comments: 30 (4 by maintainers)
Commits related to this issue
- fix(llama.cpp): downgrade to a known working version sycl support is broken otherwise. See upstream issue: https://github.com/ggerganov/llama.cpp/issues/5469 Signed-off-by: Ettore Di Giacinto <mu... — committed to mudler/LocalAI by mudler 5 months ago
- fix(llama.cpp): downgrade to a known working version (#1706) sycl support is broken otherwise. See upstream issue: https://github.com/ggerganov/llama.cpp/issues/5469 Signed-off-by: Ettore Di Gi... — committed to mudler/LocalAI by mudler 5 months ago
Seems unrelated to SYCL. (although symbols aren’t properly loaded here) Please open a new issue.
Works for me
@akarshanbiswas @channeladam please try https://github.com/ggerganov/llama.cpp/pull/5624/ and let us know if issue persists?
@abhilash1910 This time the build failed
That is quite weird. It actually works here, it does not only build. If you want to try to reproduce, this is the LocalAI container image having llama.cpp pinned at https://github.com/ggerganov/llama.cpp/commit/f026f8120f97090d34a52b3dc023c82e0ede3f7d :
quay.io/go-skynet/local-ai@sha256:c6b5dfaff64c24a02f1be8f8e1cb5c0837b130b438753e49b349d70e3d6d1916and it can do inferencing correctly. Note I’m testing this with an Intel Arc a770, so might be maybe related to that, however, using llama.cpp current commit in main fails with segfaults also on my Arc a770.You can run phi-2 configured for sycl with (f32):
to test it:
To double-check the version you can run in the container:
I am actually running this in kubernetes, any images from master are pinned to that commit, leaving also my deployment for reference:
@abhilash1910 Yes. It builds correctly but ends up in segfault with or without
-DLLAMA_SYCL_F16=ON. I am using a build from commit before that which seems to work well without the SYCL_F16 enabled. For build with-DLLAMA_SYCL_F16=ON, the build fails with a compilation error before #5411 .If you need any help in testing. Please do ping me.
Thanks for the traceback. As @mudler confirmed that https://github.com/ggerganov/llama.cpp/commit/f026f8120f97090d34a52b3dc023c82e0ede3f7d seems to be building correctly which includes #5411 already. For the time being I would recommend rolling back to the commit until a fix is applied.
Got better backtrace this time:
@abhilash1910 that commit fails here with a core-dump
Thanks @mudler , could you please check if this commit works? https://github.com/ggerganov/llama.cpp/commit/4a46d2b7923be83d6019251671ee63aa1fa0d6bc This should help in the resolution quicker.
I can confirm here, JFYI pinning to commit f026f8120f97090d34a52b3dc023c82e0ede3f7d seems to work for me (tested with Intel Arc a770)