LocalAI: Cuda inference doesn't work anymore!
LocalAI version: quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd
I also tried to build the Image myself, exact same behaviour
Environment, CPU architecture, OS, and Version: Linux lxdocker 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) x86_64 GNU/Linux
Its a proxmoc lxc with docker running inside it. The CUDA inference did work in an earlier version of this project, and llama.cpp still does work.
Describe the bug No matter how I configure the model, I can’t get the inference to run on the GPU. The GPU is being recognized, however it’s vram usage stays at 274MiB / 24576MiB
nvidia-smi does work inside the container.
When starting the container, the following message appears:
model name : 12th Gen Intel(R) Core(TM) i7-1260P
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
CPU: AVX found OK
CPU: AVX2 found OK
CPU: no AVX512 found
@@@@@
7:23PM DBG no galleries to load
7:23PM INF Starting LocalAI using 12 threads, with models path: /models
7:23PM INF LocalAI version: 3f829f1 (3f829f11f57e33e44849282b3f0d123a7bf7ea87)
┌───────────────────────────────────────────────────┐
│ Fiber v2.48.0 │
│ http://127.0.0.1:8080 │
│ (bound on host 0.0.0.0 and port 8080) │
│ │
│ Handlers ............ 31 Processes ........... 1 │
│ Prefork ....... Disabled PID ................ 14 │
└───────────────────────────────────────────────────┘
and when I make the completion call, only the CPU seems to take the load and slowly it responds. (Instead of using the GPU and being fast).
Also I somehow ALWAYS get the following message in the logs when I make the API call:
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41227: connect: connection refused"
However the api call still works. I just can’t see what the backend is doing.
If I attach to the container and go into the go-llama directory and make the test call from there:
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda-12.2/lib64" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10
I get the following output:
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 128
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.14 MB
llama_model_load_internal: mem required = 19512.67 MB (+ 3124.00 MB per state)
llama_new_context_with_model: kv self size = 195.00 MB
Model loaded successfully.
As you can see, it is able to find the GPU but it wont use it. When I write anything to it, only the CPU is used.
In the ./examples/main.go I could find the ngl parameter for gpu layers, I used it with 60 and 70 and it didn’t help. Same behaviour!
Finally I ran this:
go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 70 -n 1024
I removed all the prefix stuff, and I get the exact same behaviour with the exact same output as above. It is as if go-llama somehow doesn’t make use of the GPU anymore.
However the interesting part is this:
If I make this call:
root@lxdocker:/build/go-llama/build/bin# ./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70
(I copied the whole bash line, so the path can be seen too) It works! The GPU is being used and its super fast as expected!
Here the model output from llama.cpp:
main: build = 852 (294f424)
main: seed = 1690401176
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 6656
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 52
llama_model_load_internal: n_layer = 60
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 0.14 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2215.41 MB (+ 3124.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 436 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 60 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 63/63 layers to GPU
llama_model_load_internal: total VRAM used: 20899 MB
llama_new_context_with_model: kv self size = 780.00 MB
system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0
Building a website can be done in 10 simple steps:... (Model answers here to me)
As you can see llama.cpp is able to use the GPU… but localAI somehow isn’t. I’m trying to figure the problem out for several days now, but I just can’t… sadly I can’t code in go, so I don’t really understand what’s going on either… And the grpc stuff also seems to throw errors but somehow still work…
I hope someone with more knowledge of how the whole backend is set up can maybe help out… I tried to gather as much information as I can.
I would really love to be able to use this project!
To Reproduce Simply try to make an inference via CUDA…
Expected behavior The GPU should be used. Just as llama.cpp in its naked form in the Image itself is able to…
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 17 (6 by maintainers)
@mudler I love you man, now it works! I have built your update_rope branch and it works! ❤️
That was a local build with just cublas enabled. I had pulled the latest from git. I will try later with a different version tag/label to see if I can get it working there.
weird. I could finally reproduce in another box - I’ll try to have a look at it later today
I think am able to replicate the issues with a fresh vm in GCP, G2-standard-4 instance with 1x Nvidia L4. OS is common-gpu-debian-11-py310
Output of trying execute a model using GPU acceleration: