exllama: Splitting model on multiple GPUs is broken (ROCm)

Splitting a model between two AMD GPUs (Rx 7900XTX and Radeon VII) results in garbage output (gibberish). Tested with Llama-2-13B-chat-GPTQ and Llama-2-70B-chat-GPTQ. Running a model on just any one of the two card the output seems reasonable, although I cant vouch for the correctness of the 70B model as it cannot fit on a single card.

No flags seem to impact the results, although if i split the model and use --fused_mlp_thd 0 the following error occurs:

Exception

Traceback (most recent call last):
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/channel.py", line 428, in service
    task.service()
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 168, in service
    self.execute()
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/venv/lib/python3.11/site-packages/waitress/task.py", line 456, in execute
    for chunk in app_iter:
  File "/usr/lib/python3.11/site-packages/werkzeug/wsgi.py", line 289, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/werkzeug/wrappers/response.py", line 31, in _iter_encoded
    for item in iterable:
  File "/usr/lib/python3.11/site-packages/flask/helpers.py", line 149, in generator
    yield from gen
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 694, in respond_multi
    yield from self.respond(self.participants[1], stop_conditions, total_tokens, res_line, num_res_tokens)
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/webui/session.py", line 532, in respond
    gen_token = generator.beam_search()
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 487, in beam_search
    if self.settings.beams == 1 and self.settings.beam_length == 1: return self.gen_single_token()
                                                                           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 341, in gen_single_token
    token, _ = self.batched_sample(logits,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 64, in batched_sample
    if logits.shape[0] == 1: return self.sample(logits, temperature, top_k, top_p, min_p, typical, num)
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luigi/Documents/temp/LLAMAv2/exllama/generator.py", line 147, in sample
    sampled_ind = torch.multinomial(top_probs, top_probs.shape[-1] if num == -1 else min(num, top_probs.shape[-1]))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Compiling #146 does not seem to impact the outcome either.

The system is running Arch Linux with python-pytorch-opt-rocm 2.0.1-7

Output of rocminfo

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 3950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 3950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131809036(0x7db3f0c) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-94c2e25f00000000               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2304                               
  BDFID:                   3584                               
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx906                             
  Uuid:                    GPU-ed7030e172da5eba               
  Marketing Name:          AMD Radeon VII                     
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 26287(0x66af)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1801                               
  BDFID:                   4352                               
  Internal Node ID:        2                                  
  Compute Unit:            60                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx906:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

I am available to do any testing that may help isolate the issue, I can try to test a third card as well (RX 6800XT).

About this issue

Original URL
State: closed
Created a year ago
Comments: 40 (13 by maintainers)

Most upvoted comments

I have figured out this is a bug in either rocBLAS or Tensile.

I’ve reported it upstream: ROCmSoftwarePlatform/rocBLAS/issues/1346

opcod3 on Jul 23, 2023

Flash Attention doesn’t build on ROCm, and supposedly never will (according to their devs).

https://github.com/ROCmSoftwarePlatform/flash-attention

ardfork on Jul 20, 2023

I don’t think that’s it. If it was a problem with moving tensors between devices you should see it start out looking correct on cuda:0 and then go bad as it’s moved to a different device. According to this, you get the NaN tensor on cuda:0 already.

turboderp on Jul 21, 2023

pretty clearly a bug in rocm or pytorch, def report it upstream

IMbackK on Jul 21, 2023

Guys, back to the issue at hand, does anyone have any tips on how to figure where is the computation breaking?

I would say if the model works on either GPU, but not when split across both, you’ll want to start by focusing on where the hidden state is moved from one GPU to the next. I assume you’ve already tried --gpu-peer-fix, but otherwise the _move_tensor() function of model.py should wrap every such copy. If you don’t have a debugger (I recommend trying PyCharm which is free and pretty competent) you could add some debug output:

def _move_tensor(tensor, new_device, name, config):
    device = str(tensor.device)
    if device == new_device: return tensor

    print("------------------------------------------")
    print(f"Moving tensor {name} from {device} to {new_device}")
    print(f"Tensor on {device}:")
    print(tensor)

    if config.gpu_peer_fix:
        if str(device).startswith("cuda:") and str(new_device).startswith("cuda:"):
            tensor = tensor.to("cpu")
            print(f"Tensor on CPU:")
            print(tensor)

    tensor = tensor.to(new_device)

    print(f"Tensor on {new_device}:")
    print(tensor)

    return tensor

turboderp on Jul 21, 2023