llama.cpp: Intel CPU and Graphics card Macbook pro: failed to create context with model './models/model.q4_k_s.gguf'
Issue: Error when loading model on MacBook Pro with Intel Core i7 and Intel Iris Plus
System Information:
- Device: MacBook Pro
- CPU: Quad-Core Intel Core i7
- Graphics: Intel Iris Plus
Steps to Reproduce:
- Cloned the latest version of the repository.
- Executed make.
- Created a models directory using mkdir models.
- Within the models folder, downloaded the model using:
wget https://huggingface.co/substratusai/Llama-2-13B-chat-GGUF/resolve/main/model.bin -O model.q4_k_s.gguf
- In the llama.cpp folder, executed the following command:
./main -t 4 -m ./models/model.q4_k_s.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story\n### Response:"
Error Message:
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32 0x7fb57ba06090 | th_max = 896 | th_width = 16
ggml_metal_init: loaded kernel_mul_mm_f16_f32 0x0 | th_max = 0 | th_width = 0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "AIR builtin function was called but no definition was found." UserInfo={NSLocalizedDescription=AIR builtin function was called but no definition was found.}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/model.q4_k_s.gguf'
main: error: unable to load model
I would appreciate any guidance or advice on how to resolve this issue. Thank you!
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Reactions: 3
- Comments: 18 (2 by maintainers)
@RobinWinters @nchudleigh @ro8inmorgan @ssainz @pkrmf @Bateoriginal
if you guys are still interested, i have found an acceptable workaround that will allow you to utilize your gpu and let you offload layers to it.
make clean
brew update && brew install clblast
make LLAMA_CLBLAST=1 LLAMA_NO_METAL=1
./main -s 1 -m /Volumes/ext1tb/Models/13B/Samantha.gguf -p "I believe the purpose of life is" --ignore-eos -c 64 -n 128 -t 3 -ngl 10
some of you should certainly benefit from layer offloading. in my case offloading layers doesnt really give me any benefits, since my gpu (radeon pro 575) is about as fast as my cpu (fyi: i have tried offloading everything between 1 and 22 layers). the other aspect is the 3 gb vram extra memory – but this isnt relevant as well for me since i have enough cpu ram. but the loading time is about 20x faster now thanks to clblast:
Without clBlast -t 3
it needs 17 seconds until first token
With clBlast -t 3 -ngl 0
now it needs under 1 second until first token, and even a little bit faster with mlock:
-t 3 -ngl 0 --mlock
about 860 ms until first token
Same… My solution so far is to use -ngl 0
Unfortunately I can only give you personal recommendations based on my own trial and error experience. The llama.cpp documentation itself is not easy to keep track of, I guess that’s the reason why there is not much else to find on the internet at the moment. At least I don’t know of any other good references at the moment.
But this is not meant to be a criticism of the llama.cpp team, because one also have to remember that this is absolute bleeding edge technology that is developing incredibly fast. If I would be such a skilled developer like the guys from llama.cpp and I would understand everything at once as soon as I see the code, then my time would probably be too precious to write simple manuals and documentations as well ^^'. Okay, enough monological smalltalk done, sorry.
These seem to be both MacBooks. You can’t upgrade RAM unfortunately, too bad. With the Quad i7 you should not address more than 3 threads, so -t 3. With that you should get the fastest results in most cases. That’s because you always need a “reserve” core, which orchestrates the rest and remains for the work of the system.
About top-k: with this value you specify, for each word that should be generated next (appropriately token but we say word now), how big the “pot” of words should be from which the next word should be randomly selected. This means concretely for top-k 1000, that each time, after each word, something should be picked out of 1000 possible words. But with LLMs it is similar to us humans and our brain. When we speak, we almost always have a 100% idea of what the next word should be. Sometimes we still think very briefly whether we want to take wording A or wording B. Sometimes we might even use wording C.
e.g. if I want to say “Because of this event I am quite… 1. disappointed… 2. sad… 3. heartsick”, then I am actually already relatively undecided. But I will never be so indecisive that I have to look at 1000 words before I can decide. That’s why, in my opinion, it’s quite sufficient to take a maximum of --top-k 3.
Then it’s a matter of how “wildly” to decide between those words. If I am someone who prefers a conservative way of thinking and speaking, then I will almost certainly choose the most common word, in my case “disappointed” and rarely or never would I venture something exotic and say something like “heartsick” in the appropriate sentence. This corresponds roughly to a setting of -temp 0.2 With -temp 0.2 the first word is taken in most cases anyway, sometimes the second and rarely the third. So 997 words were considered unnecessarily with --top-k 1000.
My personal approach is actually always to take --top-k 1, because that shows me the true core of a particular language model and leaves nothing to chance. I hope this helps in understanding and setting these hyperparameters.
Yes, it is definitely worth trying the new quants. Quantization is something like compression. Q_4 means that the parameters of the model have been “compressed” to 4-bit. In Q4_K_M, most layers of the model are in 4-bit, but some layers that have certain key functions are quantized to 6-bit, giving better and smarter results than their q4_0 siblings.
Your i9 machine is a great device! You probably won’t need GPU layer offloading here either. However, make sure to always leave at least one core here as well. So take a maximum of -t 7
For example
./server -ngl 0 -t 3 --host 0.0.0.0 -c 4096 -b 2048 --mlock -m /Volumes/ext1tb/Models/13B/Synthia-13B-q4M.gguf
If you set ngl to zero, you say that no layer should be offloaded to the gpu. So -ngl 0 means that you don’t utilize the gpu.
And yes, I think it’s an issue with Macs and AMD (not only MacBooks, since I have an iMac 5k 2017)