ROCm: RuntimeError: HIP error: invalid device function - if there is a solution already existed against this issue.

Server： (Inspur NF5280A6) + (2 x Milan7453) + (16Dimm * 32GB-3200) GPU： 83:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M/6850M XT] (rev c1) OS： Centos Steam9.2， kernel：5.14.0-373.el9.x86_64 ROCm：5.7 Pytorch installation cmd：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.7

[root@6700xt ~]# dkms status amdgpu/6.2.4-1652687.el9, 5.14.0-373.el9.x86_64, x86_64: installed (original_module exists) [root@6700xt ~]#

ipython interaction

In [1]: import torch

In [2]: torch.__version__
Out[2]: '2.2.0.dev20231010+rocm5.7'

In [3]: torch.cuda.is_available()
Out[3]: True

In [4]: torch.cuda.device_count()
Out[4]: 1

In [5]: torch.cuda.current_device()
Out[5]: 0

In [6]: torch.cuda.get_device_name(torch.cuda.current_device())
Out[6]: 'AMD Radeon RX 6700 XT'

In [7]: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [8]: device
Out[8]: device(type='cuda')

In [9]: torch.rand(3, 3).to(device)
Out[9]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File /usr/local/lib/python3.9/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
    701 stream = StringIO()
    702 printer = pretty.RepresentationPrinter(stream, self.verbose,
    703     self.max_width, self.newline,
    704     max_seq_length=self.max_seq_length,
    705     singleton_pprinters=self.singleton_printers,
    706     type_pprinters=self.type_printers,
    707     deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
    709 printer.flush()
    710 return stream.getvalue()

File /usr/local/lib/python3.9/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /usr/local/lib/python3.9/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File /usr/local/lib64/python3.9/site-packages/torch/_tensor.py:442, in Tensor.__repr__(self, tensor_contents)
    438     return handle_torch_function(
    439         Tensor.__repr__, (self,), self, tensor_contents=tensor_contents
    440     )
    441 # All strings are unicode in Python 3.
--> 442 return torch._tensor_str._str(self, tensor_contents=tensor_contents)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:664, in _str(self, tensor_contents)
    662 with torch.no_grad(), torch.utils._python_dispatch._disable_current_modes():
    663     guard = torch._C._DisableFuncTorch()
--> 664     return _str_intern(self, tensor_contents=tensor_contents)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:595, in _str_intern(inp, tensor_contents)
    593                     tensor_str = _tensor_str(self.to_dense(), indent)
    594                 else:
--> 595                     tensor_str = _tensor_str(self, indent)
    597 if self.layout != torch.strided:
    598     suffixes.append("layout=" + str(self.layout))

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:347, in _tensor_str(self, indent)
    343     return _tensor_str_with_formatter(
    344         self, indent, summarize, real_formatter, imag_formatter
    345     )
    346 else:
--> 347     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    348     return _tensor_str_with_formatter(self, indent, summarize, formatter)

File /usr/local/lib64/python3.9/site-packages/torch/_tensor_str.py:138, in _Formatter.__init__(self, tensor)
    134         self.max_width = max(self.max_width, len(value_str))
    136 else:
    137     nonzero_finite_vals = torch.masked_select(
--> 138         tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
    139     )
    141     if nonzero_finite_vals.numel() == 0:
    142         # no valid number, do nothing
    143         return

**RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.**


In [10]:

About this issue

Original URL
State: open
Created 9 months ago
Comments: 22

Most upvoted comments

Setting HSA_OVERRIDE_GFX_VERSION=11.0.0 fixed this issue for me when running on gfx1100. I also have integrated graphics (reported as gfx1036 in rocminfo) which might be the reason for this bug to trigger.

UPD: it fixed HIP error: invalid device function and made some of the pytorch examples run successfully but others are now hanging indefinitely until a hard reboot (killing a process leaves GPU with 100% load).

+14

Aldrog on Nov 18, 2023

It seems your devices are gfx1032 and gfx1035 but PyTorch is not compiled for them. When you use HSA_OVERRIDE_GFX_VERSION=10.3.0 you tell ROCm to report the devices as gfx1030, which is what PyTorch compiled for.

yxsamliu on Nov 2, 2023

Did you try?

export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm

If not, try each one separate and some combination of them.

briansp2020 on Oct 10, 2023

Did you try?

export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm

If not, try each one separate and some combination of them.

Thanks this worked for me, I modified the following: export PYTORCH_ROCM_ARCH=“gfx1031” export HSA_OVERRIDE_GFX_VERSION=10.3.0 My configuration is: 6750GRE 12G

Cheese-shrimp on Jan 9, 2024

I am running into the same “HIP error: invalid device function” while trying to train a model. This ROCm test script says that everything should be working. Also environment variables are set accordingly. Any ideas?

RawEnchilada on Oct 21, 2023

Some issue with Arch Linux, rocm5.7.1 and 5.7.0 (from AUR = Ubuntu package), PyTorch 2.0.1, python 3.11.5, RX 6700S. I can reproduce with torch.ones(2).to(torch.device(0)) or torch.ones(2).to(torch.device(1))

edit: it works with HSA_OVERRIDE_GFX_VERSION=10.3.0 python

here is my rocminfo output:


[trougnouf@l rocm570]$ /opt/rocm/bin/rocminfo 
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 6900HS with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 6900HS with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4935                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    40277244(0x26694fc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1032                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6700S                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
    L3:                      32768(0x8000) KB                   
  Chip ID:                 29679(0x73ef)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2435                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            28                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 109                                
  SDMA engine uCode::      76                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1032         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1035                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5761(0x1681)                       
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2400                               
  BDFID:                   1792                               
  Internal Node ID:        2                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 113                                
  SDMA engine uCode::      37                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1035         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

trougnouf on Nov 2, 2023

system:ubuntu22.04 python:3.11.5 rocm:5.7.1 torch:2.2.0+rocm5.7-cp11-cp11 version10.25 torchvision: version10.25

I download .whl from pytorch official website,and create a conda environment. because of the dependence error,I install torch.whl without depends,and install needed dependence seperatly.

torch.cuda.is_available()=Ture

but when I create tensor,it crash.

error is same as you:

RuntimeError: HIP error: invalid device function HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

plane714 on Oct 26, 2023

Just need to put

from os import putenv
putenv("HSA_OVERRIDE_GFX_VERSION", "10.3.0")

on the top of your codes.

It’s dirty, but it’s working.

codes

The reason might be your interpreter runtime didn’t use your system variables.

Force to inject HSA_OVERRIDE_GFX_VERSION on the runtime, it will work.

supersonictw on Mar 28, 2024