ROCm: RuntimeError: HIP error: invalid device function - if there is a solution already existed against this issue.

Server: (Inspur NF5280A6) + (2 x Milan7453) + (16Dimm * 32GB-3200) GPU: 83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c1) OS: Centos Steam9.2, kernel:5.14.0-373.el9.x86_64 ROCm:5.7 Pytorch installation cmd:pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7

[root@6700xt ~]# dkms status amdgpu/6.2.4-1652687.el9, 5.14.0-373.el9.x86_64, x86_64: installed (original_module exists) [root@6700xt ~]#

ipython interaction

In [1]: import torch

In [2]: torch.__version__
Out[2]: '2.2.0.dev20231010+rocm5.7'

In [3]: torch.cuda.is_available()
Out[3]: True

In [4]: torch.cuda.device_count()
Out[4]: 1

In [5]: torch.cuda.current_device()
Out[5]: 0

In [6]: torch.cuda.get_device_name(torch.cuda.current_device())
Out[6]: 'AMD Radeon RX 6700 XT'

In [7]: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [8]: device
Out[8]: device(type='cuda')

In [9]: torch.rand(3, 3).to(device)
Out[9]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
    701 stream = StringIO()
    702 printer = pretty.RepresentationPrinter(stream, self.verbose,
    703     self.max_width, self.newline,
    704     max_seq_length=self.max_seq_length,
    705     singleton_pprinters=self.singleton_printers,
    706     type_pprinters=self.type_printers,
    707     deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
    709 printer.flush()
    710 return stream.getvalue()

File /usr/local/lib/python3.9/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /usr/local/lib/python3.9/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File /usr/local/lib64/python3.9/site-packages/torch/_tensor.py:442, in Tensor.__repr__(self, tensor_contents)
    438     return handle_torch_function(
    439         Tensor.__repr__, (self,), self, tensor_contents=tensor_contents
    440     )
    441 # All strings are unicode in Python 3.
--> 442 return torch._tensor_str._str(self, tensor_contents=tensor_contents)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:664, in _str(self, tensor_contents)
    662 with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
    663     guard = torch._C._DisableFuncTorch()
--> 664     return _str_intern(self, tensor_contents=tensor_contents)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:595, in _str_intern(inp, tensor_contents)
    593                     tensor_str = _tensor_str(self.to_dense(), indent)
    594                 else:
--> 595                     tensor_str = _tensor_str(self, indent)
    597 if self.layout != torch.strided:
    598     suffixes.append("layout=" + str(self.layout))

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:347, in _tensor_str(self, indent)
    343     return _tensor_str_with_formatter(
    344         self, indent, summarize, real_formatter, imag_formatter
    345     )
    346 else:
--> 347     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    348     return _tensor_str_with_formatter(self, indent, summarize, formatter)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:138, in _Formatter.__init__(self, tensor)
    134         self.max_width = max(self.max_width, len(value_str))
    136 else:
    137     nonzero_finite_vals = torch.masked_select(
--> 138         tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
    139     )
    141     if nonzero_finite_vals.numel() == 0:
    142         # no valid number, do nothing
    143         return

**RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.**


In [10]:

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 22

Most upvoted comments

Setting HSA_OVERRIDE_GFX_VERSION=11.0.0 fixed this issue for me when running on gfx1100. I also have integrated graphics (reported as gfx1036 in rocminfo) which might be the reason for this bug to trigger.

UPD: it fixed HIP error: invalid device function and made some of the pytorch examples run successfully but others are now hanging indefinitely until a hard reboot (killing a process leaves GPU with 100% load).

It seems your devices are gfx1032 and gfx1035 but PyTorch is not compiled for them. When you use HSA_OVERRIDE_GFX_VERSION=10.3.0 you tell ROCm to report the devices as gfx1030, which is what PyTorch compiled for.

Did you try?

export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm

If not, try each one separate and some combination of them.

Did you try?

export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm

If not, try each one separate and some combination of them.

Thanks this worked for me, I modified the following: export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.0 My configuration is: 6750GRE 12G

I am running into the same “HIP error: invalid device function” while trying to train a model. This ROCm test script says that everything should be working. Also environment variables are set accordingly. Any ideas?

Some issue with Arch Linux, rocm5.7.1 and 5.7.0 (from AUR = Ubuntu package), PyTorch 2.0.1, python 3.11.5, RX 6700S. I can reproduce with torch.ones(2).to(torch.device(0)) or torch.ones(2).to(torch.device(1))

edit: it works with HSA_OVERRIDE_GFX_VERSION=10.3.0 python

here is my rocminfo output:

[trougnouf@l rocm570]$ /opt/rocm/bin/rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 6900HS with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 6900HS with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4935                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6700S                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29679(0x73ef)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2435                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 109                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1035                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2400                               
  BDFID:                   1792                               
  Internal Node ID:        2                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 113                                
  SDMA engine uCode::      37                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1035         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

system:ubuntu22.04 python:3.11.5 rocm:5.7.1 torch:2.2.0+rocm5.7-cp11-cp11 version10.25 torchvision: version10.25

I download .whl from pytorch official website,and create a conda environment. because of the dependence error,I install torch.whl without depends,and install needed dependence seperatly.

torch.cuda.is_available()=Ture

but when I create tensor,it crash.

error is same as you:

RuntimeError: HIP error: invalid device function HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

Just need to put

from os import putenv
putenv("HSA_OVERRIDE_GFX_VERSION", "10.3.0")

on the top of your codes.

It’s dirty, but it’s working.

codes

The reason might be your interpreter runtime didn’t use your system variables.

Force to inject HSA_OVERRIDE_GFX_VERSION on the runtime, it will work.