LocalAI: CUDA does not work anymore with llama backend

LocalAI version: quay.io/go-skynet/local-ai:v1.22.0-cublas-cuda11

Environment, CPU architecture, OS, and Version:

Linux glados 6.2.0-26-generic #26-Ubuntu SMP PREEMPT_DYNAMIC Mon Jul 10 23:39:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 RTX 3090, Ubuntu 23.04

Describe the bug Previously, I had the v1.18.0 image with cuda11 running correctly. Now, after updating the image to v1.22.0, I get the following error in the debug log when trying to do a chat completion with a llama-based model:

stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:2478: CUDA driver version is insufficient for CUDA runtime version

To Reproduce

Run the mentioned docker image on a system with an nvidia gpu. (Set PRELOAD_MODELS to e.g. '[{"url": "github:go-skynet/model-gallery/openllama_7b.yaml", "name": "gpt-3.5-turbo", "overrides": { "f16":true, "gpu_layers": 35, "mmap": true, "batch": 512 } } ]'

Try a chat completion:

$ curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
                        "model": "gpt-3.5-turbo",
                        "messages": [{"role": "user", "content": "How are you?"}],
                        "temperature": 0.9 
                      }' | jq.
{
  "error": {
    "code": 500,
    "message": "could not load model: rpc error: code = Unavailable desc = error reading from server: EOF",
    "type": ""
  }
}

Expected behavior The completion result is returned.

Logs

5:12PM DBG Request received: {"model":"gpt-3.5-turbo","language":"","n":0,"top_p":0,"top_k":0,"temperature":0.9,"max_tokens":0,"echo":false,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"typical_p":0,"seed":0,"file":"","response_format":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"How are you?"}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null}
5:12PM DBG Configuration read: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
5:12PM DBG Parameters: &{PredictionOptions:{Model:open-llama-7b-q4_0.bin Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0} Name:gpt-3.5-turbo StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1024 F16:true NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:true MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
5:12PM DBG Prompt (before templating): How are you?
5:12PM DBG Template found, input modified to: Q: How are you?\nA: 
5:12PM DBG Prompt (after templating): Q: How are you?\nA: 
5:12PM DBG Loading model llama from open-llama-7b-q4_0.bin
5:12PM DBG Loading model in memory from file: /models/open-llama-7b-q4_0.bin
5:12PM DBG Loading GRPC Model llama: {backendString:llama modelFile:open-llama-7b-q4_0.bin threads:4 assetDir:/tmp/localai/backend_data context:0xc00003c088 gRPCOptions:0xc000c1a2d0 externalBackends:map[huggingface-embeddings:/build/extra/grpc/huggingface/huggingface.py]}
5:12PM DBG Loading GRPC Process%!(EXTRA string=/tmp/localai/backend_data/backend-assets/grpc/llama)
5:12PM DBG GRPC Service for open-llama-7b-q4_0.bin will be running at: '127.0.0.1:43913'
5:12PM DBG GRPC Service state dir: /tmp/go-processmanager1220113550
5:12PM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:43913: connect: connection refused"
5:12PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:43913): stderr 2023/07/30 17:12:57 gRPC Server listening at 127.0.0.1:43913
5:12PM DBG GRPC Service Ready
5:12PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/models/open-llama-7b-q4_0.bin ContextSize:1024 Seed:0 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:35 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
5:12PM DBG GRPC(open-llama-7b-q4_0.bin-127.0.0.1:43913): stderr CUDA error 35 at /build/go-llama/llama.cpp/ggml-cuda.cu:2478: CUDA driver version is insufficient for CUDA runtime version
[10.0.1.8]:60326  500  -  POST     /v1/chat/completions

Additional context

About this issue

Original URL
State: closed
Created a year ago
Comments: 16 (1 by maintainers)

Most upvoted comments

Okay, I found the problem in my case. I am using swarm mode and it turns out I needed to explicitely set the env variable NVIDIA_VISIBLE_DEVICES on the container. Turns out this is explicitely set in the official CUDA images. That explains why I don’t have this problem in other OSS AI projects which are using those official images as a base.

It seems to me all the other problems reported here have different causes, so I will close this issue. Feel free to open new issues as necessary.

djmaze on Oct 16, 2023