llama.cpp: [SYCL] Segmentation fault after #5411

System: Arch Linux, CPU: Intel i3 12th gen GPU: Intel Arc A750 RAM: 16GB

llama.cpp version: b2134

Previously the build was failing with -DLLAMA_SYCL_F16=ON which has been fixed in #5411. Upon running this build, it crashes with segmentation fault.

logs:

bin/main -m ~/Public/Models/Weights/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  -p "hello " -n 1000 -ngl 99
Log start
main: build = 2134 (099afc62)
main: built with Intel(R) oneAPI DPC++/C++ Compiler 2024.0.0 (2024.0.0.20231017) for x86_64-unknown-linux-gnu
main: seed  = 1707789832
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   yes
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) A750 Graphics,	compute capability 1.3,
	max compute_units 448,	max work group size 1024,	max sub group size 32,	global mem size 8096681984
  Device 1: Intel(R) FPGA Emulation Device,	compute capability 1.2,
	max compute_units 4,	max work group size 67108864,	max sub group size 64,	global mem size 16577347584
  Device 2: 12th Gen Intel(R) Core(TM) i3-12100F,	compute capability 3.0,
	max compute_units 4,	max work group size 8192,	max sub group size 64,	global mem size 16577347584
  Device 3: Intel(R) Arc(TM) A750 Graphics,	compute capability 3.0,
	max compute_units 448,	max work group size 1024,	max sub group size 32,	global mem size 8096681984
Using device 0 (Intel(R) Arc(TM) A750 Graphics) as main device
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from /home/tensorblast/Public/Models/Weights/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 15
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q4_K:  135 tensors
llama_model_loader: - type q6_K:   21 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 636.18 MiB (4.85 BPW) 
llm_load_print_meta: general.name     = tinyllama_tinyllama-1.1b-chat-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:            buffer size =   601.02 MiB
llm_load_tensors:        CPU buffer size =    35.16 MiB
.....................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:        CPU input buffer size   =     5.01 MiB
zsh: segmentation fault (core dumped)  bin/main -m  -p "hello " -n

~The build without -DLLAMA_SYCL_F16=ON works.~

Confirmed: This crash started happening after #5411

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 30 (4 by maintainers)

Commits related to this issue

Most upvoted comments

Seems unrelated to SYCL. (although symbols aren’t properly loaded here) Please open a new issue.

Works for me

@abhilash1910 This time the build failed

[ 42%] Building CXX object CMakeFiles/ggml.dir/ggml-sycl.cpp.o
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:1123:55: warning: cast from 'const void *' to 'unsigned char *' drops const qualifier [-Wcast-qual]
 1123 |                 auto it = m_map.upper_bound((byte_t *)ptr);
      |                                                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:3659:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const int' [-Wsign-compare]
 3659 |     if (item_ct1.get_group(0) < ne02) { // src0
      |         ~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:3701:31: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const int' [-Wsign-compare]
 3701 |         item_ct1.get_group(0) < ne02) {
      |         ~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:3700:46: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'const int' [-Wsign-compare]
 3700 |     if (nidx < ne00 && item_ct1.get_group(1) < ne01 &&
      |                        ~~~~~~~~~~~~~~~~~~~~~ ^ ~~~~
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8330:23: error: use of undeclared identifier 'nb1'
 8330 |     const size_t s1 = nb1 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8331:23: error: use of undeclared identifier 'nb2'
 8331 |     const size_t s2 = nb2 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8332:23: error: use of undeclared identifier 'nb3'
 8332 |     const size_t s3 = nb3 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8383:23: error: use of undeclared identifier 'nb1'
 8383 |     const size_t s1 = nb1 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8384:23: error: use of undeclared identifier 'nb2'
 8384 |     const size_t s2 = nb2 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8385:23: error: use of undeclared identifier 'nb3'
 8385 |     const size_t s3 = nb3 / ggml_element_size(dst);
      |                       ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8435:24: error: use of undeclared identifier 'ne0'; did you mean 'new'?
 8435 |         int nr0 = ne10/ne0;
      |                        ^~~
      |                        new
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8435:27: error: expected a type
 8435 |         int nr0 = ne10/ne0;
      |                           ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8436:24: error: use of undeclared identifier 'ne1'; did you mean 'new'?
 8436 |         int nr1 = ne11/ne1;
      |                        ^~~
      |                        new
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8436:27: error: expected a type
 8436 |         int nr1 = ne11/ne1;
      |                           ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8437:24: error: use of undeclared identifier 'ne2'
 8437 |         int nr2 = ne12/ne2;
      |                        ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8438:19: error: use of undeclared identifier 'ne13'
 8438 |         int nr3 = ne13/ne3;
      |                   ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8438:24: error: use of undeclared identifier 'ne3'
 8438 |         int nr3 = ne13/ne3;
      |                        ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8443:27: error: use of undeclared identifier 'ne0'; did you mean 'new'?
 8443 |         int64_t cne0[] = {ne0, ne1, ne2, ne3};
      |                           ^~~
      |                           new
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8443:30: error: expected a type
 8443 |         int64_t cne0[] = {ne0, ne1, ne2, ne3};
      |                              ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8444:45: error: use of undeclared identifier 'ne13'
 8444 |         int64_t cne1[] = {ne10, ne11, ne12, ne13};
      |                                             ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8445:26: error: use of undeclared identifier 'nb0'
 8445 |         size_t cnb0[] = {nb0, nb1, nb2, nb3};
      |                          ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8445:31: error: use of undeclared identifier 'nb1'
 8445 |         size_t cnb0[] = {nb0, nb1, nb2, nb3};
      |                               ^
/home/tensorblast/Public/Models/debug/llama.cpp/ggml-sycl.cpp:8445:36: error: use of undeclared identifier 'nb2'
 8445 |         size_t cnb0[] = {nb0, nb1, nb2, nb3};
      |                                    ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
4 warnings and 20 errors generated.
make[3]: *** [CMakeFiles/ggml.dir/build.make:132: CMakeFiles/ggml.dir/ggml-sycl.cpp.o] Error 1
make[2]: *** [CMakeFiles/Makefile2:758: CMakeFiles/ggml.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:2491: examples/main/CMakeFiles/main.dir/rule] Error 2
make: *** [Makefile:998: main] Error 2

@abhilash1910 Yes. It builds correctly but ends up in segfault with or without -DLLAMA_SYCL_F16=ON . I am using a build from commit before that which seems to work well without the SYCL_F16 enabled. For build with -DLLAMA_SYCL_F16=ON, the build fails with a compilation error before #5411 .

If you need any help in testing. Please do ping me.

That is quite weird. It actually works here, it does not only build. If you want to try to reproduce, this is the LocalAI container image having llama.cpp pinned at https://github.com/ggerganov/llama.cpp/commit/f026f8120f97090d34a52b3dc023c82e0ede3f7d : quay.io/go-skynet/local-ai@sha256:c6b5dfaff64c24a02f1be8f8e1cb5c0837b130b438753e49b349d70e3d6d1916 and it can do inferencing correctly. Note I’m testing this with an Intel Arc a770, so might be maybe related to that, however, using llama.cpp current commit in main fails with segfaults also on my Arc a770.

You can run phi-2 configured for sycl with (f32):

docker run -e DEBUG=true -ti -v $PWD/models:/build/models -p 8080:8080  -v /dev/dri:/dev/dri quay.io/go-skynet/local-ai@sha256:c6b5dfaff64c24a02f1be8f8e1cb5c0837b130b438753e49b349d70e3d6d1916 https://gist.githubusercontent.com/mudler/103de2576a8fd4b583f9bd53f4e4cefd/raw/9181d4add553326806b8fdbf4ff0cd65d2145bff/phi-2-sycl.yaml

to test it:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
          "model": "phi-2",
          "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}]
      }'

To double-check the version you can run in the container:

cat /build/Makefile  | grep CPPLLAMA_VERSION 
CPPLLAMA_VERSION?=f026f8120f97090d34a52b3dc023c82e0ede3f7d

I am actually running this in kubernetes, any images from master are pinned to that commit, leaving also my deployment for reference:

apiVersion: v1
kind: Namespace
metadata:
  name: local-ai
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: models-pvc
  namespace: local-ai
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: local-ai
  namespace: local-ai
  labels:
    app: local-ai
spec:
  selector:
    matchLabels:
      app: local-ai
  replicas: 1
  template:
    metadata:
      labels:
        app: local-ai
      name: local-ai
    spec:
      containers:
        - env:
          - name: DEBUG
            value: "true"
          name: local-ai
          args:
          # phi-2 configuration
          - https://gist.githubusercontent.com/mudler/103de2576a8fd4b583f9bd53f4e4cefd/raw/9181d4add553326806b8fdbf4ff0cd65d2145bff/phi-2-sycl.yaml
          image: quay.io/go-skynet/local-ai:master-sycl-f32-core
          imagePullPolicy: Always
          resources:
            limits:
              gpu.intel.com/i915: 1
          volumeMounts:
            - name: models-volume
              mountPath: /build/models
      volumes:
        - name: models-volume
          persistentVolumeClaim:
            claimName: models-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: local-ai
  namespace: local-ai
spec:
  selector:
    app: local-ai
  type: NodePort
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

@abhilash1910 Yes. It builds correctly but ends up in segfault with or without -DLLAMA_SYCL_F16=ON . I am using a build from commit before that which seems to work well without the SYCL_F16 enabled. For build with -DLLAMA_SYCL_F16=ON, the build fails with a compilation error before #5411 .

If you need any help in testing. Please do ping me.

Got better backtrace this time:

(gdb) bt
#0  ggml_backend_sycl_buffer_type_name (
    buft=0xcd5f20 <ggml_backend_sycl_buffer_type::ggml_backend_sycl_buffer_types>)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-sycl.cpp:14765
#1  0x0000000000519869 in ggml_backend_buft_name (
    buft=0xcd5f20 <ggml_backend_sycl_buffer_type::ggml_backend_sycl_buffer_types>)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-backend.c:19
#2  0x0000000000517613 in ggml_gallocr_reserve_n (galloc=0x36ad020, 
    graph=0x7fffa420efe0, node_buffer_ids=0x39a7fe0)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-alloc.c:707
#3  0x000000000051bcf3 in ggml_backend_sched_reserve (sched=0x7fffa4200010, 
    measure_graph=0x7fffc0200030)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-backend.c:1563
#4  0x0000000000470293 in llama_new_context_with_model (model=0x3669dd0, params=...)
    at /home/tensorblast/Public/Models/llama.cpp/llama.cpp:11461
#5  0x000000000042f9a0 in llama_init_from_gpt_params (params=...)
    at /home/tensorblast/Public/Models/llama.cpp/common/common.cpp:1300
#6  0x000000000041bce9 in main (argc=<optimized out>, argv=0x7fffffffcfb8)
    at /home/tensorblast/Public/Models/llama.cpp/examples/main/main.cpp:198
    

Thanks for the traceback. As @mudler confirmed that https://github.com/ggerganov/llama.cpp/commit/f026f8120f97090d34a52b3dc023c82e0ede3f7d seems to be building correctly which includes #5411 already. For the time being I would recommend rolling back to the commit until a fix is applied.

Got better backtrace this time:

(gdb) bt
#0  ggml_backend_sycl_buffer_type_name (
    buft=0xcd5f20 <ggml_backend_sycl_buffer_type::ggml_backend_sycl_buffer_types>)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-sycl.cpp:14765
#1  0x0000000000519869 in ggml_backend_buft_name (
    buft=0xcd5f20 <ggml_backend_sycl_buffer_type::ggml_backend_sycl_buffer_types>)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-backend.c:19
#2  0x0000000000517613 in ggml_gallocr_reserve_n (galloc=0x36ad020, 
    graph=0x7fffa420efe0, node_buffer_ids=0x39a7fe0)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-alloc.c:707
#3  0x000000000051bcf3 in ggml_backend_sched_reserve (sched=0x7fffa4200010, 
    measure_graph=0x7fffc0200030)
    at /home/tensorblast/Public/Models/llama.cpp/ggml-backend.c:1563
#4  0x0000000000470293 in llama_new_context_with_model (model=0x3669dd0, params=...)
    at /home/tensorblast/Public/Models/llama.cpp/llama.cpp:11461
#5  0x000000000042f9a0 in llama_init_from_gpt_params (params=...)
    at /home/tensorblast/Public/Models/llama.cpp/common/common.cpp:1300
#6  0x000000000041bce9 in main (argc=<optimized out>, argv=0x7fffffffcfb8)
    at /home/tensorblast/Public/Models/llama.cpp/examples/main/main.cpp:198
    

@abhilash1910 that commit fails here with a core-dump

I can confirm here, JFYI pinning to commit f026f81 seems to work for me (tested with Intel Arc a770)

Thanks @mudler , could you please check if this commit works? https://github.com/ggerganov/llama.cpp/commit/4a46d2b7923be83d6019251671ee63aa1fa0d6bc This should help in the resolution quicker.

I can confirm here, JFYI pinning to commit f026f8120f97090d34a52b3dc023c82e0ede3f7d seems to work for me (tested with Intel Arc a770)