llama.cpp: WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found)

Solution: https://github.com/ggerganov/llama.cpp/issues/1230#issuecomment-1532562940

UPD: Confirmed working just fine on Windows.

Issue below happened only on WSL.

https://github.com/ggerganov/llama.cpp/pull/1207

First i pull and clean

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# git pull
Already up to date.
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make clean
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'

Build fresh with cuBLAS

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_mul_mat_use_blas’:
ggml.c:7921:36: warning: unused parameter ‘src0’ [-Wunused-parameter]
 7921 |         const struct ggml_tensor * src0,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:8520:31: warning: unused variable ‘y’ [-Wunused-variable]
 8520 |                 const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
      |                               ^
ggml.c: In function ‘ggml_compute_forward_alibi_f32’:
ggml.c:9104:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9104 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
ggml.c: In function ‘ggml_compute_forward_alibi_f16’:
ggml.c:9165:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9165 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

Trying to load model that worked before update

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12 --no-mmap
main: seed = 1682770733
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 8740093.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
CUDA error 2 at ggml-cuda.cu:359: out of memory

I haven’t updated my libllama.so for llama-cpp-python yet, so it uses previous version, and works with this very model just fine. Smth happened.

RTX 3050 8GB

UPD 2: Issue persists on WSL. I did full clean and yet it doesn’t work after being built with current version.

UPD 3: I found some years old version llama.cpp and did exactly same thing and everything worked fine. So i guess it’s not me be especially dumb today.

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17

Most upvoted comments

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is: wsl.exe --update Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda. Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+-------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+---------------------------------+----------------------+----------------------+

+----------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|==============================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+--------------------------------------------------------------------------------+

I honestly can’t get this to work, I tried everything you did, reinstalled WSL and cuda like twice, still the same error, Here’s my nvcc and nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070         On | 00000000:26:00.0  On |                  N/A |
| 30%   47C    P0               41W / 239W|   1751MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Any clue as to what could be wrong? literally loading the smallest model, 20 layers on the 7B model, and no luck

(PytorchEnv) yuicchi@DESKTOP-DJ3R5OF:/mnt/d/Yuicchi Text Model/llama.cpp$ ./main -ngl 20 --ctx_size 2048 -n 2048 -c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 512 --repeat_penalty 1.17647 --seed 1685501956 --model “./models/7B/ggml-model-q4_0.bin” --threads 8 --n_predict 4096 --color --prompt “Write out a detailed step by step method on how to create a website” --no-mmap WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible. main: build = 610 (d8bd001) main: seed = 1685501956 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 3615.71 MB WARNING: failed to allocate 3615.71 MB of pinned memory: out of memory llama_model_load_internal: mem required = 3235.84 MB (+ 1026.00 MB per state) llama_model_load_internal: [cublas] offloading 20 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2171 MB … llama_init_from_file: kv self size = 1024.00 MB WARNING: failed to allocate 768.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 4096, n_keep = 0

Write out a detailed step by step method on how to create a websiteCUDA error 2 at ggml-cuda.cu:565: out of memory

Due to current cuda bug I think you need to set no pinned for enviroment variables. Command for it: “export GGML_CUDA_NO_PINNED=1”

Looks like it’s failing to allocate host pinned memory. I will add a patch to revert to normal pageable memory when this happens.

In the meanwhile, removing the --no-map should work. You can also try adding more memory to WSL2 in .wslconfig.

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:

wsl.exe --update

Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.

Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

In case this is useful, I am using Ubuntu 22.04 in WSL2 under Windows 11, with a RTX 3080 and latest drivers. That works for me.