llama.cpp: nvcc fatal : Value 'native' is not defined for option 'gpu-architecture' when compiling cuBLAS

There seems to be an issue on the Makefile that’s assigning an invalid nvcc flag for gpu-architecture. I’m running on Ubuntu22.04 on a RTX4090, starting from the Docker image nvidia/cuda:12.3.1-runtime-ubuntu22.04.

Running make creates the following error:

root@ZEPPELIN-01:/workspace/llama.cpp# make LLAMA_CUBLAS=1
expr: syntax error: unexpected argument '070100'
expr: syntax error: unexpected argument '080100'
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib 
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:429: ggml-cuda.o] Error 1

The same issue seems to happen in whisper.cpp, by the way – I also created an issue report there.

The issue seems to be resolved by simply changing the flag on the Makefile from ‘native’ to ‘all’. However, I still cannot make llama.cpp to compile with the GPU for some reason.

About this issue

  • Original URL
  • State: closed
  • Created 5 months ago
  • Comments: 15

Most upvoted comments

If that doesn’t work, try make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all

(it seems from the install readme that the CuBLAS flag changed from LLAMA_CUBLAS to LLAMA_CUDA, so that’s why I sent the first message)

You are right, thank you, it worked after adding the CUDA_DOCKER_ARCH=all

When it was running I noticed a message saying that CUBLAS was depricated and people should switch to LLAMA_CUDA in the future…

Many thanks

If that doesn’t work, try make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all

(it seems from the install readme that the CuBLAS flag changed from LLAMA_CUBLAS to LLAMA_CUDA, so that’s why I sent the first message)

Are you compiling it with make @hiddengerbil? Try make LLAMA_CUDA=1 CUDA_DOCKER_ARCH=all

Try this: Clone the repository: git clone https://github.com/ggerganov/llama.cpp.git Build Lllama.cpp

cd llama.cpp sed -i ‘s/-arch=native/-arch=all/g’ Makefile make clean && LLAMA_CUBLAS=1 make -j

all means it’d build for all architectures that your CUDA version supports, which would take a long time and not be a good default. That docker issue is know, you can set the value with the environment variable CUDA_DOCKER_ARCH. For a 4090 the correct value would be compute_89.