llama.cpp: Stuck loading VRAM ROCm multi gpu
Context
Once it loads it stuck at loading VRAM
My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7
ROCM-SMI Output
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU Temp (DieEdge) AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 69.0c 26.0W 28Mhz 96Mhz 22.75% auto 291.0W 67% 0%
1 50.0c 30.0W 33Mhz 96Mhz 14.9% auto 282.0W 67% 0%
====================================================================================
=============================== End of ROCm SMI Log ================================
$ python3 --version Python 3.10.12 $ make --version GNU Make 4.3 Built for x86_64-pc-linux-gnu $ g++ --version g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Steps to Reproduce
./main -ngl 99 -m ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"
Failure Logs
Log start
main: build = 1487 (c41ea36)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1699438381
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
Device 0: Radeon RX 7900 XTX, compute capability 11.0
Device 1: Radeon RX 7900 XT, compute capability 11.0
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.ffn_down.weight q4_0 [ 28672, 8192, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
...
llama_model_loader: - tensor 718: blk.79.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 719: blk.79.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 720: blk.79.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 721: output_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 722: output.weight q6_K [ 8192, 32000, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 18: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_0: 561 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V2
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 36.20 GiB (4.51 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.26 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required = 140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 29 (2 by maintainers)
must i make with
OR
for multi gpu ?
I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me. And for the stuck loading, try launching it with -no-mmap, you will need enough ram or swap for the full model. I had these problems with 2X MI25