llama.cpp: Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf'

Issue: Error when loading model on MacBook Pro with Intel Core i7 and Intel Iris Plus

System Information:

Device: MacBook Pro
CPU: Quad-Core Intel Core i7
Graphics: Intel Iris Plus

Steps to Reproduce:

Cloned the latest version of the repository.
Executed make.
Created a models directory using mkdir models.
Within the models folder, downloaded the model using: wget https://huggingface.co/substratusai/Llama-2-13B-chat-GGUF/resolve/main/model.bin -O model.q4_k_s.gguf
In the llama.cpp folder, executed the following command: ./main -t 4 -m ./models/model.q4_k_s.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story\n### Response:"

Error Message:

ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7fb57ba06090 | th_max =  896 | th_width =   16
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "AIR builtin function was called but no definition was found." UserInfo={NSLocalizedDescription=AIR builtin function was called but no definition was found.}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/model.q4_k_s.gguf'
main: error: unable to load model

I would appreciate any guidance or advice on how to resolve this issue. Thank you!

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 3
Comments: 18 (2 by maintainers)

Most upvoted comments

@RobinWinters @nchudleigh @ro8inmorgan @ssainz @pkrmf @Bateoriginal

if you guys are still interested, i have found an acceptable workaround that will allow you to utilize your gpu and let you offload layers to it.

first remove the current build make clean
make sure you have clblast installed, if not then brew update && brew install clblast
now you can run a new build, but disable metal and enable clblast make LLAMA_CLBLAST=1 LLAMA_NO_METAL=1
that’s it, you can from now on run ./main or ./server with gpu acceleration and even with the possibility to offload layers, like for example so:

./main -s 1 -m /Volumes/ext1tb/Models/13B/Samantha.gguf -p "I believe the purpose of life is" --ignore-eos -c 64 -n 128 -t 3 -ngl 10

some of you should certainly benefit from layer offloading. in my case offloading layers doesnt really give me any benefits, since my gpu (radeon pro 575) is about as fast as my cpu (fyi: i have tried offloading everything between 1 and 22 layers). the other aspect is the 3 gb vram extra memory – but this isnt relevant as well for me since i have enough cpu ram. but the loading time is about 20x faster now thanks to clblast:

Without clBlast -t 3

llama_print_timings:        load time = 17314,26 ms
llama_print_timings:      sample time =    86,12 ms /   128 runs   (    0,67 ms per token,  1486,28 tokens per second)
llama_print_timings: prompt eval time =   956,49 ms /     8 tokens (  119,56 ms per token,     8,36 tokens per second)
llama_print_timings:        eval time = 26066,80 ms /   127 runs   (  205,25 ms per token,     4,87 tokens per second)

it needs 17 seconds until first token

With clBlast -t 3 -ngl 0

llama_print_timings:        load time =   920,24 ms
llama_print_timings:      sample time =    74,38 ms /   128 runs   (    0,58 ms per token,  1721,01 tokens per second)
llama_print_timings: prompt eval time =  1086,07 ms /     8 tokens (  135,76 ms per token,     7,37 tokens per second)
llama_print_timings:        eval time = 25951,83 ms /   127 runs   (  204,35 ms per token,     4,89 tokens per second)
llama_print_timings:       total time = 27143,61 ms

now it needs under 1 second until first token, and even a little bit faster with mlock:

-t 3 -ngl 0 --mlock

ngl 0 t 3 mlock
llama_print_timings:        load time =   858,57 ms
llama_print_timings:      sample time =    74,57 ms /   128 runs   (    0,58 ms per token,  1716,44 tokens per second)
llama_print_timings: prompt eval time =   982,50 ms /     8 tokens (  122,81 ms per token,     8,14 tokens per second)
llama_print_timings:        eval time = 25761,31 ms /   127 runs   (  202,84 ms per token,     4,93 tokens per second)
llama_print_timings:       total time = 26850,29 ms

about 860 ms until first token

mounta11n on Sep 22, 2023

Same… My solution so far is to use -ngl 0

ggml_metal_init: allocating
ggml_metal_init: found device: AMD Radeon Pro 575
ggml_metal_init: picking default device: AMD Radeon Pro 575

# ...

gml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7feed820ff50 | th_max = 1024 | th_width =   64
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model '/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf'
{"timestamp":1694485384,"level":"ERROR","function":"loadModel","line":265,"message":"unable to load model","model":"/Volumes/ext1tb/Models/llama-2-7b-chat-codeCherryPop.Q5_K_M.gguf"}

mounta11n on Sep 12, 2023

Unfortunately I can only give you personal recommendations based on my own trial and error experience. The llama.cpp documentation itself is not easy to keep track of, I guess that’s the reason why there is not much else to find on the internet at the moment. At least I don’t know of any other good references at the moment.

But this is not meant to be a criticism of the llama.cpp team, because one also have to remember that this is absolute bleeding edge technology that is developing incredibly fast. If I would be such a skilled developer like the guys from llama.cpp and I would understand everything at once as soon as I see the code, then my time would probably be too precious to write simple manuals and documentations as well ^^'. Okay, enough monological smalltalk done, sorry.

These seem to be both MacBooks. You can’t upgrade RAM unfortunately, too bad. With the Quad i7 you should not address more than 3 threads, so -t 3. With that you should get the fastest results in most cases. That’s because you always need a “reserve” core, which orchestrates the rest and remains for the work of the system.

About top-k: with this value you specify, for each word that should be generated next (appropriately token but we say word now), how big the “pot” of words should be from which the next word should be randomly selected. This means concretely for top-k 1000, that each time, after each word, something should be picked out of 1000 possible words. But with LLMs it is similar to us humans and our brain. When we speak, we almost always have a 100% idea of what the next word should be. Sometimes we still think very briefly whether we want to take wording A or wording B. Sometimes we might even use wording C.

e.g. if I want to say “Because of this event I am quite… 1. disappointed… 2. sad… 3. heartsick”, then I am actually already relatively undecided. But I will never be so indecisive that I have to look at 1000 words before I can decide. That’s why, in my opinion, it’s quite sufficient to take a maximum of --top-k 3.

Then it’s a matter of how “wildly” to decide between those words. If I am someone who prefers a conservative way of thinking and speaking, then I will almost certainly choose the most common word, in my case “disappointed” and rarely or never would I venture something exotic and say something like “heartsick” in the appropriate sentence. This corresponds roughly to a setting of -temp 0.2 With -temp 0.2 the first word is taken in most cases anyway, sometimes the second and rarely the third. So 997 words were considered unnecessarily with --top-k 1000.

My personal approach is actually always to take --top-k 1, because that shows me the true core of a particular language model and leaves nothing to chance. I hope this helps in understanding and setting these hyperparameters.

Yes, it is definitely worth trying the new quants. Quantization is something like compression. Q_4 means that the parameters of the model have been “compressed” to 4-bit. In Q4_K_M, most layers of the model are in 4-bit, but some layers that have certain key functions are quantized to 6-bit, giving better and smarter results than their q4_0 siblings.

Your i9 machine is a great device! You probably won’t need GPU layer offloading here either. However, make sure to always leave at least one core here as well. So take a maximum of -t 7

mounta11n on Sep 25, 2023

For example ./server -ngl 0 -t 3 --host 0.0.0.0 -c 4096 -b 2048 --mlock -m /Volumes/ext1tb/Models/13B/Synthia-13B-q4M.gguf

If you set ngl to zero, you say that no layer should be offloaded to the gpu. So -ngl 0 means that you don’t utilize the gpu.

And yes, I think it’s an issue with Macs and AMD (not only MacBooks, since I have an iMac 5k 2017)

mounta11n on Sep 15, 2023