llama.cpp: Freeze after offloading layers to GPU

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.cpp does not freeze and continues to run normally, not interfering with basic windows operations.

Current Behavior

llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q5_K - Medium
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 35995.03 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/83 layers to GPU
llm_load_tensors: VRAM used: 10500 MB

llama.cpp then freezes and will not respond. Task Manager shows 0% CPU or GPU load. It is also somehow unable to be stopped via task manager, requiring me to hard reset my computer to end the program. It also causes general system instability, as I am writing this with my desktop blacked out and file explorer frozen.

Environment and Context

Windows 10 128 GB RAM Threadripper 3970X RTX 2080TI CMake 3.27.4 CUDA 12.2

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Run a model with CUBlas. (My exact command: main -ngl 18 -m E:\largefiles\LLAMA-2\70B\uni-tianyan-70b.Q5_K_M.gguf --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first)

Failure Logs

I’d love to attach them, but file manager stopped working. I’ll try and run it again tomorrow and upload the log before everything freezes.

About this issue

Original URL
State: closed
Created 10 months ago
Reactions: 2
Comments: 27 (6 by maintainers)

Most upvoted comments

Oh, it’s not frozen. Just very slow. It just generated its first token (It’s been running for around 2 hours now): “1”.

Void-0000 on Nov 17, 2023