GPTQ-for-LLaMa: Multiple errors while compiling the kernel

Hello, while trying to run python setup_cuda.py install, I get this error:

(venv) C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa>python setup_cuda.py install
running install
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info\PKG-INFO
writing dependency_links to quant_cuda.egg-info\dependency_links.txt
writing top-level names to quant_cuda.egg-info\top_level.txt
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quant_cuda.egg-info\SOURCES.txt'
writing manifest file 'quant_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:358: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\utils\cpp_extension.py:387: UserWarning: The detected CUDA version (11.4) has a minor version mismatch with the version that was used to compile PyTorch (11.7). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'quant_cuda' extension
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" /EHsc /Tpquant_cuda.cpp /Fobuild\temp.win-amd64-cpython-310\Release\quant_cuda.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /EHsc -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0
quant_cuda.cpp
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline

Then after a long list of errors, I get this at the end:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\bin\nvcc" -c quant_cuda_kernel.cu -o build\temp.win-amd64-cpython-310\Release\quant_cuda_kernel.obj -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\TH -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.4\include" -IC:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\include "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\include" "-IC:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\Include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.29.30133\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt" -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_75,code=compute_75 -gencode=arch=compute_75,code=sm_75 --use-local-env
quant_cuda_kernel.cu
C:/Users/Username/Documents/GitHub/GPTQ-for-LLaMa/venv/lib/site-packages/torch/include\c10/macros/Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(624): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here

C:\Users\Username\Documents\GitHub\GPTQ-for-LLaMa\venv\lib\site-packages\torch\include\pybind11\cast.h(717): error: too few arguments for template template parameter "Tuple"
          detected during instantiation of class "pybind11::detail::tuple_caster<Tuple, Ts...> [with Tuple=std::pair, Ts=<T1, T2>]"
(721): here

2 errors detected in the compilation of "quant_cuda_kernel.cu".
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.4\\bin\\nvcc.exe' failed with exit code 1

Any idea what could be causing this? I’ve tried installing CUDA Toolkit 11.3 and Torch 1.12.1, but they too give the same error.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 34 (1 by maintainers)

Most upvoted comments

Finally I managed to get it running. (I still can’t compile it, thank you @Brawlence for providing windows wheel) Here is the guide;

1. Install the latest version of text-generation-webui

2. Create directory `text-generation-webui\repositories` and clone GPTQ-for-LLaMa there

3. Stay in the same conda env and install [this wheel](https://github.com/oobabooga/text-generation-webui/files/10947842/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl.zip) with CUDA module. (`pip install quant_cuda-0.0.0-cp310-cp310-win_amd64.whl`)

4. Copy 4bit model to `models` folder and ensure that its name is in following format (example: `llama-30b-4bit.pt`). You still must have the directory with 8bit model in HFv2 format.

5. Start the webui `python .\server.py --model llama-30b --load-in-4bit --no-stream --listen`

Tested on Windows 11 with 30B model and RTX 4090.

Thank you! this actually worked, now loading the 13B at around 9GB vram. I noticed tho that the speed in linux is ridiculously faster than windows, even 4bit 13B on windows is like half the speed of normal run of 13B on linux… 😮

@lxe

git clone https://github.com/zphang/transformers.git

This repo only contains a readme.md:

March 6th, 2019:

Repository has been moved to https://github.com/zphang/bert_on_stilts

Should we use the one mentioned in the readme.md, which is also from March 2019? I doubt. If not, which transformers repo should we install? The live one, or the one with the llama push via git clone --branch llama_push https://github.com/zphang/transformers.git?

EDIT This comment is linked from elsewhere. Here’s a more coherent guide: https://gist.github.com/lxe/82eb87db25fdb75b92fa18a6d494ee3c


I had to downgrade cuda and torch and was able to compile. Here’s my full process on windows:

  1. Install Build Tools for Visual Studio 2019 (has to be 2019) here
  2. Install miniconda
  3. Open “x64 native tools command prompt”
  4. Activate conda via powershell -ExecutionPolicy ByPass -NoExit -Command "& 'C:\Users\lxe\miniconda3\shell\condabin\conda-hook.ps1' ; conda activate 'C:\Users\lxe\miniconda3' "
  5. conda create -n gptq
  6. conda activate gptq
  7. conda install cuda -c nvidia/label/cuda-11.3.0 -c nvidia/label/cuda-11.3.1
  8. conda install pip
  9. git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git
  10. git clone https://github.com/zphang/transformers.git
  11. pip install ./transformers
  12. pip install torch==1.12+cu113 -f https://download.pytorch.org/whl/torch_stable.html
  13. cd GPTQ-for-LLaMa
  14. $env:DISTUTILS_USE_SDK=1
  15. python setup_cuda.py install

When using the webui, make sure it’s in the same env. If it overwrites torch, you’ll have to do it again manually.