llama.cpp: Stuck loading VRAM ROCm multi gpu

Context

Once it loads it stuck at loading VRAM

My computer is running Dual AMD GPU 7900 XTX and 7900 XT Ubuntu 22.04 , ROCm 5.7

ROCM-SMI Output

========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK   MCLK   Fan     Perf  PwrCap  VRAM%  GPU%  
0    69.0c           26.0W   28Mhz  96Mhz  22.75%  auto  291.0W   67%   0%    
1    50.0c           30.0W   33Mhz  96Mhz  14.9%   auto  282.0W   67%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================

$ python3 --version Python 3.10.12 $ make --version GNU Make 4.3 Built for x86_64-pc-linux-gnu $ g++ --version g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Steps to Reproduce

./main -ngl 99 -m ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf -mg 0 -p "Write a function in TypeScript that sums numbers"

Failure Logs

Log start
main: build = 1487 (c41ea36)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1699438381
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0
  Device 1: Radeon RX 7900 XT, compute capability 11.0
llama_model_loader: loaded meta data with 19 key-value pairs and 723 tensors from ../llama_cpp_models/llama-2-70b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_0     [  8192, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q4_0     [ 28672,  8192,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_0     [  8192, 28672,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_0     [  8192, 28672,     1,     1 ]
...
llama_model_loader: - tensor  718:        blk.79.attn_output.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  719:             blk.79.attn_q.weight q4_0     [  8192,  8192,     1,     1 ]
llama_model_loader: - tensor  720:             blk.79.attn_v.weight q4_0     [  8192,  1024,     1,     1 ]
llama_model_loader: - tensor  721:               output_norm.weight f32      [  8192,     1,     1,     1 ]
llama_model_loader: - tensor  722:                    output.weight q6_K     [  8192, 32000,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str     
llama_model_loader: - kv   1:                               general.name str     
llama_model_loader: - kv   2:                       llama.context_length u32     
llama_model_loader: - kv   3:                     llama.embedding_length u32     
llama_model_loader: - kv   4:                          llama.block_count u32     
llama_model_loader: - kv   5:                  llama.feed_forward_length u32     
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32     
llama_model_loader: - kv   7:                 llama.attention.head_count u32     
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32     
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32     
llama_model_loader: - kv  10:                          general.file_type u32     
llama_model_loader: - kv  11:                       tokenizer.ggml.model str     
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr     
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr     
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr     
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32     
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32     
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32     
llama_model_loader: - kv  18:               general.quantization_version u32     
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  561 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 36.20 GiB (4.51 BPW) 
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using ROCm for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Radeon RX 7900 XTX) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 36930.11 MB
...........................................................................

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 29 (2 by maintainers)

Most upvoted comments

must i make with

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

OR

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=0,1 make -j

for multi gpu ?

I had the same kind of problems too. You have to build it with make, cmake caused the cuda errors for me. And for the stuck loading, try launching it with -no-mmap, you will need enough ram or swap for the full model. I had these problems with 2X MI25