LocalAI: Cuda inference doesn't work anymore!

LocalAI version: quay.io/go-skynet/local-ai:sha-72e3e23-cublas-cuda12-ffmpeg@sha256:f868a3348ca3747843542eeb1391003def43c92e3fafa8d073af9098a41a7edd

I also tried to build the Image myself, exact same behaviour

Environment, CPU architecture, OS, and Version: Linux lxdocker 6.2.16-4-pve #1 SMP PREEMPT_DYNAMIC PVE 6.2.16-5 (2023-07-14T17:53Z) x86_64 GNU/Linux

Its a proxmoc lxc with docker running inside it. The CUDA inference did work in an earlier version of this project, and llama.cpp still does work.

Describe the bug No matter how I configure the model, I can’t get the inference to run on the GPU. The GPU is being recognized, however it’s vram usage stays at 274MiB / 24576MiB

nvidia-smi does work inside the container.

When starting the container, the following message appears:

model name	: 12th Gen Intel(R) Core(TM) i7-1260P
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
@@@@@
7:23PM DBG no galleries to load
7:23PM INF Starting LocalAI using 12 threads, with models path: /models
7:23PM INF LocalAI version: 3f829f1 (3f829f11f57e33e44849282b3f0d123a7bf7ea87)
 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 31  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 
 └───────────────────────────────────────────────────┘ 

and when I make the completion call, only the CPU seems to take the load and slowly it responds. (Instead of using the GPU and being fast).

Also I somehow ALWAYS get the following message in the logs when I make the API call:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:41227: connect: connection refused"

However the api call still works. I just can’t see what the backend is doing.

If I attach to the container and go into the go-llama directory and make the test call from there:

CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda-12.2/lib64" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10

I get the following output:

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.14 MB
llama_model_load_internal: mem required  = 19512.67 MB (+ 3124.00 MB per state)
llama_new_context_with_model: kv self size  =  195.00 MB
Model loaded successfully.

As you can see, it is able to find the GPU but it wont use it. When I write anything to it, only the CPU is used.

In the ./examples/main.go I could find the ngl parameter for gpu layers, I used it with 60 and 70 and it didn’t help. Same behaviour!

Finally I ran this:

go run ./examples -m "/models/guanaco-33B.ggmlv3.q4_0.bin" -t 10 -ngl 70 -n 1024

I removed all the prefix stuff, and I get the exact same behaviour with the exact same output as above. It is as if go-llama somehow doesn’t make use of the GPU anymore.

However the interesting part is this:

If I make this call:

root@lxdocker:/build/go-llama/build/bin# ./main -m /models/guanaco-33B.ggmlv3.q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 1024 -ngl 70

(I copied the whole bash line, so the path can be seen too) It works! The GPU is being used and its super fast as expected!

Here the model output from llama.cpp:

main: build = 852 (294f424)
main: seed  = 1690401176
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama.cpp: loading model from /models/guanaco-33B.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size =    0.14 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 2215.41 MB (+ 3124.00 MB per state)
llama_model_load_internal: allocating batch_size x (768 kB + n_ctx x 208 B) = 436 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 60 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 63/63 layers to GPU
llama_model_load_internal: total VRAM used: 20899 MB
llama_new_context_with_model: kv self size  =  780.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 1024, n_keep = 0


 Building a website can be done in 10 simple steps:... (Model answers here to me)

As you can see llama.cpp is able to use the GPU… but localAI somehow isn’t. I’m trying to figure the problem out for several days now, but I just can’t… sadly I can’t code in go, so I don’t really understand what’s going on either… And the grpc stuff also seems to throw errors but somehow still work…

I hope someone with more knowledge of how the whole backend is set up can maybe help out… I tried to gather as much information as I can.

I would really love to be able to use this project!

To Reproduce Simply try to make an inference via CUDA…

Expected behavior The GPU should be used. Just as llama.cpp in its naked form in the Image itself is able to…

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 17 (6 by maintainers)

Commits related to this issue

Most upvoted comments

@mudler I love you man, now it works! I have built your update_rope branch and it works! ❤️

@emakkus can you try to add REBUILD=true to the env vars and see if persist?

@larkinwc is that a pre compiled binary or did you compiled it locally?

That was a local build with just cublas enabled. I had pulled the latest from git. I will try later with a different version tag/label to see if I can get it working there.

weird. I could finally reproduce in another box - I’ll try to have a look at it later today

I think am able to replicate the issues with a fresh vm in GCP, G2-standard-4 instance with 1x Nvidia L4. OS is common-gpu-debian-11-py310

       _,met$$$$$gg.          l_williams_capone@gpu-test-2 
    ,g$$$$$$$$$$$$$$$P.       ---------------------------- 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 11 (bullseye) x86_64 
 ,$$P'              `$$$.     Host: Google Compute Engine 
',$$P       ,ggs.     `$$b:   Kernel: 5.10.0-23-cloud-amd64 
`d$$'     ,$P"'   .    $$$    Uptime: 8 hours, 11 mins 
 $$P      d$'     ,    $$P    Packages: 704 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.1.4 
 $$;      Y$b._   _,d$P'      Terminal: /dev/pts/0 
 Y$$.    `.`"Y$$$$P"'         CPU: Intel Xeon (4) @ 2.200GHz 
 `$$b      "-.__              GPU: NVIDIA 00:03.0 NVIDIA Corporation Device 27b8 
  `Y$$                        Memory: 448MiB / 16008MiB 
   `Y$$.
     `$$b.                                            
       `Y$$b.                                         
          `"Y$b._
              `"""

Output of trying execute a model using GPU acceleration:

l_williams_capone@gpu-test-2:~/LocalAI$ sudo ./local-ai --address :8081 --debug 
1:49AM DBG no galleries to load
1:49AM INF Starting LocalAI using 4 threads, with models path: /home/l_williams_capone/LocalAI/models
1:49AM INF LocalAI version: v1.22.0-6-gc79ddd6 (c79ddd6fc4cbd6eb64ed2a8220176ce7cbf40b6e)
1:49AM DBG Model: baichuan-7b (config: {PredictionOptions:{Model: Language: N:0 TopP:0 TopK:0 Temperature:0 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:baichuan-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
1:49AM DBG Model: openllama-7b (config: {PredictionOptions:{Model:open-llama-7b-q4_0 Language: N:0 TopP:0.7 TopK:80 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:openllama-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:0 Debug:false Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:})
1:49AM DBG Extracting backend assets files to /tmp/localai/backend_data

 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8081               │ 
 │       (bound on host 0.0.0.0 and port 8081)       │ 
 │                                                   │ 
 │ Handlers ............ 32  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ............. 85347 │ 
 └───────────────────────────────────────────────────┘ 

1:49AM DBG Request received: 
1:49AM DBG `input`: &{PredictionOptions:{Model:openllama-7b Language: N:0 TopP:0 TopK:0 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Context:context.Background.WithCancel Cancel:0x4b9060 File: ResponseFormat: Size: Prompt:1 Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil>}
1:49AM DBG Parameter Config: &{PredictionOptions:{Model:open-llama-7b-q4_0 Language: N:0 TopP:0.7 TopK:80 Temperature:0.7 Maxtokens:0 Echo:false Batch:0 F16:true IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0} Name:openllama-7b StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:0 F16:false NUMA:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat:openllama-chat ChatMessage: Completion:openllama-completion Edit: Functions:} MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:35 MMap:false MMlock:false LowVRAM:false TensorSplit: MainGPU: ImageGenerationAssets: PromptCachePath: PromptCacheAll:false PromptCacheRO:false Grammar: PromptStrings:[1] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} SystemPrompt:}
1:49AM DBG Template found, input modified to: Q: Complete the following text: 1\nA: 

1:49AM DBG Loading model llama from open-llama-7b-q4_0
1:49AM DBG Loading model in memory from file: /home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0
1:49AM DBG Loading GRPC Model llama: {backendString:llama modelFile:open-llama-7b-q4_0 threads:4 assetDir:/tmp/localai/backend_data context:0xc0000ae010 gRPCOptions:0xc000220ab0 externalBackends:map[]}
1:49AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama
1:49AM DBG GRPC Service for open-llama-7b-q4_0 will be running at: '127.0.0.1:40753'
1:49AM DBG GRPC Service state dir: /tmp/go-processmanager973188895
1:49AM DBG GRPC Service Started
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:40753: connect: connection refused"
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 2023/07/27 01:49:23 gRPC Server listening at 127.0.0.1:40753
1:49AM DBG GRPC Service Ready
1:49AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:/home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0 ContextSize:0 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:35 MainGPU: TensorSplit: Threads:4 LibrarySearchPath:}
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr ggml_init_cublas: found 1 CUDA devices:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr   Device 0: NVIDIA L4, compute capability 8.9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama.cpp: loading model from /home/l_williams_capone/LocalAI/models/open-llama-7b-q4_0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: format     = ggjt v3 (latest)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_vocab    = 32000
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_ctx      = 512
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_embd     = 4096
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_mult     = 256
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_head     = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_head_kv  = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_layer    = 32
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_rot      = 128
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_gqa      = 1
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: n_ff       = 11008
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: freq_base  = 0.0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: freq_scale = 5.60519e-44
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: ftype      = 2 (mostly Q4_0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: model size = 7B
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: ggml ctx size = 3615.73 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: using CUDA for GPU acceleration
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: mem required  =  372.40 MB (+  512.00 MB per state)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 288 MB VRAM for the scratch buffer
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading 32 repeating layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading non-repeating layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading v cache to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloading k cache to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: offloaded 35/35 layers to GPU
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_model_load_internal: total VRAM used: 4090 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr llama_new_context_with_model: kv self size  =  512.00 MB
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr fatal error: unexpected signal during runtime execution
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr [signal SIGSEGV: segmentation violation code=0x1 addr=0x100 pc=0x7fc7e53c3ab8]
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime stack:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.throw({0x9aa8d8?, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0x7ffcbf547748 sp=0x7ffcbf547718 pc=0x45587d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.sigpanic()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/signal_unix.go:825 +0x3e9 fp=0x7ffcbf5477a8 sp=0x7ffcbf547748 pc=0x46bd29
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 38 [syscall]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.cgocall(0x8182d0, 0xc0001157d8)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/cgocall.go:157 +0x5c fp=0xc0001157b0 sp=0xc000115778 pc=0x42499c
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/go-llama%2ecpp._Cfunc_load_model(0x2971f10, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x23, 0x200, ...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     _cgo_gotypes.go:238 +0x4d fp=0xc0001157d8 sp=0xc0001157b0 pc=0x81110d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/go-llama%2ecpp.New({0xc00011a680, 0x39}, {0xc000072b00, 0x5, 0x9011e0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/go-llama/llama.go:26 +0x257 fp=0xc0001158e0 sp=0xc0001157d8 pc=0x8117f7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc/llm/llama.(*LLM).Load(0xc0000142a0, 0xc000196750)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/llm/llama/llama.go:52 +0x66d fp=0xc0001159a8 sp=0xc0001158e0 pc=0x814fed
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc.(*server).LoadModel(0x97d740?, {0xc000196750?, 0x5dbdc6?}, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/server.go:42 +0x28 fp=0xc000115a10 sp=0xc0001159a8 pc=0x817248
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc/proto._Backend_LoadModel_Handler({0x95ed40?, 0xc00005fd20}, {0xa428f0, 0xc00017ed80}, 0xc000161d50, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/proto/backend_grpc.pb.go:236 +0x170 fp=0xc000115a68 sp=0xc000115a10 pc=0x80e4f0
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).processUnaryRPC(0xc00017c1e0, {0xa45578, 0xc000082340}, 0xc0000ea000, 0xc00017ea20, 0xd85a10, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1337 +0xdf3 fp=0xc000115e48 sp=0xc000115a68 pc=0x7f7393
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).handleStream(0xc00017c1e0, {0xa45578, 0xc000082340}, 0xc0000ea000, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:1714 +0xa36 fp=0xc000115f68 sp=0xc000115e48 pc=0x7fc4b6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).serveStreams.func1.1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:959 +0x98 fp=0xc000115fe0 sp=0xc000115f68 pc=0x7f4d98
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000115fe8 sp=0xc000115fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc.(*Server).serveStreams.func1
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:957 +0x18c
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 1 [IO wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc00018fb68 sp=0xc00018fb48 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.netpollblock(0xc00018fbf8?, 0x42402f?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc00018fba0 sp=0xc00018fb68 pc=0x450f17
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.runtime_pollWait(0x7fc791cd3ef8, 0x72)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc00018fbc0 sp=0xc00018fba0 pc=0x482009
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).wait(0xc00015e280?, 0x0?, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc00018fbe8 sp=0xc00018fbc0 pc=0x4f0312
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).waitRead(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*FD).Accept(0xc00015e280)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_unix.go:614 +0x2bd fp=0xc00018fc90 sp=0xc00018fbe8 pc=0x4f5c1d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*netFD).accept(0xc00015e280)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/fd_unix.go:172 +0x35 fp=0xc00018fd48 sp=0xc00018fc90 pc=0x607115
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPListener).accept(0xc000012618)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/tcpsock_posix.go:148 +0x25 fp=0xc00018fd70 sp=0xc00018fd48 pc=0x61f985
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPListener).Accept(0xc000012618)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/tcpsock.go:297 +0x3d fp=0xc00018fda0 sp=0xc00018fd70 pc=0x61ea7d
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).Serve(0xc00017c1e0, {0xa42180?, 0xc000012618})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:821 +0x475 fp=0xc00018fee8 sp=0xc00018fda0 pc=0x7f39b5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr github.com/go-skynet/LocalAI/pkg/grpc.StartServer({0x7ffcbf5678b4?, 0xc000024190?}, {0xa44af0?, 0xc0000142a0})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/pkg/grpc/server.go:121 +0x125 fp=0xc00018ff50 sp=0xc00018fee8 pc=0x817de5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr main.main()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/LocalAI/cmd/grpc/llama/main.go:22 +0x85 fp=0xc00018ff80 sp=0xc00018ff50 pc=0x817f45
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.main()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc00018ffe0 sp=0xc00018ff80 pc=0x4581a7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc00018ffe8 sp=0xc00018ffe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 2 [force gc (idle)]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000048fb0 sp=0xc000048f90 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.forcegchelper()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:305 +0xb0 fp=0xc000048fe0 sp=0xc000048fb0 pc=0x458410
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000048fe8 sp=0xc000048fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.init.6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:293 +0x25
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 3 [GC sweep wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000049780 sp=0xc000049760 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.bgsweep(0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcsweep.go:278 +0x8e fp=0xc0000497c8 sp=0xc000049780 pc=0x4447ce
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gcenable.func1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc0000497e0 sp=0xc0000497c8 pc=0x439a86
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000497e8 sp=0xc0000497e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.gcenable
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:178 +0x6b
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 4 [GC scavenge wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc000070000?, 0xa3b380?, 0x1?, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000049f70 sp=0xc000049f50 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goparkunlock(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:387
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.(*scavengerState).park(0xdd1b20)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcscavenge.go:400 +0x53 fp=0xc000049fa0 sp=0xc000049f70 pc=0x4426f3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.bgscavenge(0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgcscavenge.go:628 +0x45 fp=0xc000049fc8 sp=0xc000049fa0 pc=0x442cc5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gcenable.func2()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:179 +0x26 fp=0xc000049fe0 sp=0xc000049fc8 pc=0x439a26
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000049fe8 sp=0xc000049fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.gcenable
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mgc.go:179 +0xaa
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 5 [finalizer wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x1a0?, 0xdd2040?, 0x60?, 0x78?, 0xc000048770?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000048628 sp=0xc000048608 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.runfinq()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mfinal.go:193 +0x107 fp=0xc0000487e0 sp=0xc000048628 pc=0x438ac7
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000487e8 sp=0xc0000487e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by runtime.createfing
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/mfinal.go:163 +0x45
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 35 [select]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc000271f00?, 0x2?, 0xc3?, 0x3a?, 0xc000271ed4?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000271d60 sp=0xc000271d40 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.selectgo(0xc000271f00, 0xc000271ed0, 0x629ea9?, 0x0, 0xc0000b2000?, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc000271ea0 sp=0xc000271d60 pc=0x4681be
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0000c2050, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:418 +0x115 fp=0xc000271f30 sp=0xc000271ea0 pc=0x768e95
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc00022e2a0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/controlbuf.go:552 +0x91 fp=0xc000271f90 sp=0xc000271f30 pc=0x769611
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func2()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:341 +0xda fp=0xc000271fe0 sp=0xc000271f90 pc=0x780ffa
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000271fe8 sp=0xc000271fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:338 +0x1bb3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 36 [select]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0xc0000e0770?, 0x4?, 0x10?, 0x0?, 0xc0000e06c0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0000e0508 sp=0xc0000e04e8 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.selectgo(0xc0000e0770, 0xc0000e06b8, 0x0?, 0x0, 0x0?, 0x1)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc0000e0648 sp=0xc0000e0508 pc=0x4681be
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc000082340)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:1155 +0x233 fp=0xc0000e07c8 sp=0xc0000e0648 pc=0x7886d3
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.NewServerTransport.func4()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x26 fp=0xc0000e07e0 sp=0xc0000e07c8 pc=0x780ee6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0000e07e8 sp=0xc0000e07e0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc/internal/transport.NewServerTransport
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:344 +0x1bf8
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr 
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr goroutine 37 [IO wait]:
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.gopark(0x100000008?, 0xb?, 0x0?, 0x0?, 0x6?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000056aa0 sp=0xc000056a80 pc=0x4585d6
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.netpollblock(0x4d5745?, 0x42402f?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:527 +0xf7 fp=0xc000056ad8 sp=0xc000056aa0 pc=0x450f17
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.runtime_pollWait(0x7fc791cd3e08, 0x72)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/netpoll.go:306 +0x89 fp=0xc000056af8 sp=0xc000056ad8 pc=0x482009
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).wait(0xc000094080?, 0xc0000aa000?, 0x0)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32 fp=0xc000056b20 sp=0xc000056af8 pc=0x4f0312
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*pollDesc).waitRead(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr internal/poll.(*FD).Read(0xc000094080, {0xc0000aa000, 0x8000, 0x8000})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/internal/poll/fd_unix.go:167 +0x299 fp=0xc000056bb8 sp=0xc000056b20 pc=0x4f16f9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*netFD).Read(0xc000094080, {0xc0000aa000?, 0x1060100000000?, 0x8?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/fd_posix.go:55 +0x29 fp=0xc000056c00 sp=0xc000056bb8 pc=0x604f89
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*conn).Read(0xc0000a6000, {0xc0000aa000?, 0x50?, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/net/net.go:183 +0x45 fp=0xc000056c48 sp=0xc000056c00 pc=0x616ac5
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr net.(*TCPConn).Read(0x800010601?, {0xc0000aa000?, 0x0?, 0xc000056ca8?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     <autogenerated>:1 +0x29 fp=0xc000056c78 sp=0xc000056c48 pc=0x629ba9
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr bufio.(*Reader).Read(0xc0000a00c0, {0xc0000c4040, 0x9, 0x0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/bufio/bufio.go:237 +0x1bb fp=0xc000056cb0 sp=0xc000056c78 pc=0x57b97b
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr io.ReadAtLeast({0xa3ec00, 0xc0000a00c0}, {0xc0000c4040, 0x9, 0x9}, 0x9)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/io/io.go:332 +0x9a fp=0xc000056cf8 sp=0xc000056cb0 pc=0x4cf6ba
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr io.ReadFull(...)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/io/io.go:351
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr golang.org/x/net/http2.readFrameHeader({0xc0000c4040?, 0x9?, 0xc0000a4060?}, {0xa3ec00?, 0xc0000a00c0?})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:237 +0x6e fp=0xc000056d48 sp=0xc000056cf8 pc=0x7290ce
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr golang.org/x/net/http2.(*Framer).ReadFrame(0xc0000c4000)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/golang.org/x/net@v0.12.0/http2/frame.go:498 +0x95 fp=0xc000056df8 sp=0xc000056d48 pc=0x729915
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc000082340, 0x0?, 0x0?)
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/internal/transport/http2_server.go:642 +0x167 fp=0xc000056f10 sp=0xc000056df8 pc=0x784327
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).serveStreams(0xc00017c1e0, {0xa45578?, 0xc000082340})
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:946 +0x162 fp=0xc000056f80 sp=0xc000056f10 pc=0x7f4ae2
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr google.golang.org/grpc.(*Server).handleRawConn.func1()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:889 +0x46 fp=0xc000056fe0 sp=0xc000056f80 pc=0x7f4386
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr runtime.goexit()
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000056fe8 sp=0xc000056fe0 pc=0x487461
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr created by google.golang.org/grpc.(*Server).handleRawConn
1:49AM DBG GRPC(open-llama-7b-q4_0-127.0.0.1:40753): stderr     /home/l_williams_capone/go/pkg/mod/google.golang.org/grpc@v1.56.2/server.go:888 +0x185
[66.68.171.135]:50325  500  -  POST     /v1/completions