private-gpt: Segmentation Fault on Intel Mac

I got a segmentation fault running the basic setup in the documentation. This may be an obvious issue I have simply overlooked but I am guessing if I have run into it, others will as well.

llama_new_context_with_model: n_ctx = 3900 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: kv self size = 487.50 MB llama_build_graph: non-view tensors processed: 740/740 ggml_metal_init: allocating ggml_metal_init: found device: Intel(R) UHD Graphics 630 ggml_metal_init: found device: AMD Radeon Pro 5500M ggml_metal_init: picking default device: AMD Radeon Pro 5500M ggml_metal_init: default.metallib not found, loading from source make: *** [run] Segmentation fault: 11

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 19 (5 by maintainers)

Most upvoted comments

Hey guys - sorry for the late reply, I was really busy at work lately

Given the amount of comment, I will a single and grouped answer, hoping to help most/all of you guys (Apple Folks ✌️)

Help

While we are doing our best to help you during your privateGPT installation, we are not maintaining these libraries. Please find below instructions to try and fix your existing setup. However, please know that there might be cleaner solution already available for you in llama.cpp and llama-cpp-python. To see their help, please refer yourself to the link in the section Additional Help, at the bottom of this message.

How to know in which “acceleration mode” am I running

By acceleration mode, I mean “where library is being used to do the computation”: Metal? *BLAS? etc.

By looking at the logs returned by llama.cpp (the lines that does not start by <time> [INFO ] (these are python logs); llama.cpp logs starts with llm_load, llama_model_loader, etc) you can know what your installation is trying to use. The llama.cpp logs will look like the following:

llama_model_loader <general information on the model you loaded>
llama_model_loader: - tensor ... <the different layers and their properties in the deep neural network that you loaded>
...
llama_model_loader: - tensor ... <the different layers and their properties in the deep neural network that you loaded>
llm_load_vocab ...
llm_load_print_meta <metadata on the model you loaded - help you understand the memory requirements, the computation approximation, the prompt separators, etc>
...
llm_load_print_meta ... 
llm_load_tensors <the mem required to run this model>
llama_new_context_with_model <...>
llama_build_graph <...>
[SPECIFIC LOG NAME] <show initialization logs of your acceleration framework used>

For example, if your installation is configured to use Metal, you will see ggml_metal_init.

How to disable Metal

CMAKE_ARGS="-DLLAMA_METAL=off" pip install --force-reinstall --no-cache-dir llama-cpp-python

Memory requirements

By default, privateGPT will try to put all the neural network layers computation in GPU (i.e. will load all the layers in GPU memory *). For Apple Chip (M1, M2, M3, etc), given that the memory on these system are unified memory, that means that this the “normal” RAM that is being used (if I’m not mistaken). For Apple Computer with Intel chips (and other GPUs), that means that the model will try to be loaded in your graphics card memory (often called VRAM)!

That means that, if your GPU only have 500MB (or 1.5GB) and you are trying to run a model that is of size 4GB, it will not work (and will return segementation fault 11 etc, because you are trying to allocate more than you have)

Solutions to run `privateGPT`

The following solutions are possible. You can try them one by one, and see which one suits you.

CPU Only - disable Metal

Do not change the privateGPT code, and change instead the configuration of libraries it is using

CMAKE_ARGS="-DLLAMA_METAL=off" pip install --force-reinstall --no-cache-dir llama-cpp-python

CPU Only - change how `privateGPT` load the model

Change this line, and make it model_kwargs={"n_gpu_layers": 0}, to disable the load of the model in GPU. You can also try to put some values, such as 50 or 100 (to still offload some layers to the GPU, but not all of them, because -1 means all layers)

# Your local git repository
cd privateGPT/

# Modify the file with your text editor or IDE of your choice private_gpt/components/llm/llm_component.py
vim private_gpt/components/llm/llm_component.py

Run smaller models

If you are still seeing segmentation fault (trying to allocate more memory than you have), you can try to reduce the size of the model you are running (at the cost of having answers that are less neat). For example, you can pick the smallest model of Mistral (all of them are available on this page: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF), which is currently mistral-7b-instruct-v0.1.Q2_K.gguf.

Modify your settings to specify this model:

cd privateGPT/

# Modify the file with your text editor or IDE of your choice settings.yaml
vim settings.yaml

# Set the model you want to use in the file above; here, using the smallest one:
  llm_hf_model_file: mistral-7b-instruct-v0.1.Q2_K.gguf

Try with other versions of `llama-cpp-python`

Given that during the re-installation of llama-cpp-python we are not specifying a specific version, you might try to compile a version of llama.cpp that has a bug not yet fix. You can try to install fixed (and older) version of llama-cpp-python by replacing llama-cpp-python by llama-cpp-python==X.Y.Z, where X.Y.Z is a version number in https://github.com/abetlen/llama-cpp-python/releases.

Example:

CMAKE_ARGS="-DLLAMA_METAL=on" pip install --force-reinstall --no-cache-dir llama-cpp-python==0.2.17

Additional Help

You can find additional help directly on the libraries that privateGPT is using:

llama-cpp-python: https://github.com/abetlen/llama-cpp-python
- The python bindings of llama.cpp - this is doing the compilation on your host
- Take a look at their README and their open issues
llama.cpp: https://github.com/ggerganov/llama.cpp
- The lib that actually runs LLM
- Take a look at their REAME and open issues

Additional tips

Python Wheel compilation (`pip install`) in verbose mode

Add -vvv to your pip command, this will display the logs at compilation of llama.cpp, showing you the framework that have been used to compile your lib.

Force CMAKE

While doing some reading to write this comment, I found that some articles are recommending to force CMake usage by setting this environment variable FORCE_CMAKE=1

+12

lopagela on Nov 16, 2023

Hacky Fix if Installed with Conda

I was able to fix this by:

Navigating to the directory in the conda environment where llama cpp was installed. (Note I used the way installed with pip).
Deleting ggml-metal.metal
Copying it into my current working directory of the script.

Overall, it seems that there is an issue with finding the correct path to ggml-metal.metal if installed normally.

kenminsoo on Nov 20, 2023

Please compare the VRAM of your GPU, and the VRAM asked by the model.

Also, I found that the llama-cpp-python (i.e. llama.cpp) version that privateGPT is using is not working well in METAL mode on Apple device that does not have Mx chips (i.e. it does not run well if you have Apple devices running on Intel).

You can try to run using BLAS variants instead of Metal

More information in the README of https://github.com/abetlen/llama-cpp-python

lopagela on Nov 9, 2023