ROCm: `hipMemcpy` from device to host stuck on device 1

Description

Our workstation has two Radeon RX 7900 XTX cards. Today we tried the following code stuck on the hipMemcpy line at the end.

#include <hip/hip_runtime.h>

#include <cstdlib>
#include <iostream>
#include <vector>

constexpr int error_exit_code = -1;

/// \brief Checks if the provided error code is \p hipSuccess and if not,
/// prints an error message to the standard error output and terminates the program
/// with an error code.
#define HIP_CHECK(condition)                                                                \
    {                                                                                       \
        const hipError_t error = condition;                                                 \
        if(error != hipSuccess)                                                             \
        {                                                                                   \
            std::cerr << "An error encountered: \"" << hipGetErrorString(error) << "\" at " \
                      << __FILE__ << ':' << __LINE__ << std::endl;                          \
            std::exit(error_exit_code);                                                     \
        }                                                                                   \
    }

int main(const int argc, const char** argv)
{
    int                dev_id = 1;      // <==== use device id 1
    float*             d_a{};
    std::vector<float> h_a(10);
    HIP_CHECK(hipSetDevice(dev_id));

    HIP_CHECK(hipMalloc(&d_a, 10 * sizeof(float)));

    std::cout << "start to copy from device to host\n";
    HIP_CHECK(hipMemcpy(h_a.data(), d_a, sizeof(float) * 10, hipMemcpyDeviceToHost));    //  <==== stuck here
    std::cout << "finish copying from device to host\n";
    return 0;
}

The code simply allocates memory on GPU 1, and copies the memory content back to host, and it gets stuck on hipMemcpy.

NOTE: If we use GPU 0 by setting dev_id to 0, the code works perfectly and does not get stuck on hipMemcpy. It only gets stuck for GPU 1. Using env variable ROCR_VISIBLE_DEVICES=1 and make dev_id = 0 does not resolve the issue.

We further test the official matrix multiplication examples at https://github.com/amd/rocm-examples/blob/develop/HIP-Basic/matrix_multiplication/main.hip. If we don’t change anything, the program happen for GPU 0 and work smoothly. When we change to use GPU 1 by either hipSetDevice(1) or ROCR_VISIBLE_DEVICES=1, the program gets stuck. So we suppose there might be some internal issue of ROCm when it comes with multiple GPUs.


Workstation Environment

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Reactions: 2
  • Comments: 15

Most upvoted comments

  1. It seems stuck when copying code object from host to device. You can try to disable power feature when loading amdgpu driver, then run your test.
$ sudo vim /etc/default/grub
add "amdgpu.ppfeaturemask=0xffff3fff amdgpu.runpm=0x0" into GRUB_CMDLINE_LINUX_DEFAULT
$ sudo update-grub
$ reboot
$ cat /proc/cmdline
see if the modification takes effect

It works!!! Thank you @xfyucg, I appreciate it so much. Now no stuck happens again for GPU 0 and 1.

I’m also curious about how the “power feature” can affect this? Would you mind sharing a bit more about the reason behind?

Simply speaking, to save power, GPU will enter a sleeping state when it is idle and wake up when there is work need to do. That is a collaboration of driver, firmware and hardware. In your case, GPU 1 doesn’t wake up as expected to process copy work. See https://lpc.events/event/9/contributions/633/ for more details.

  1. Actually there are 3 GPUs in your system, gfx1100, gfx1100, gfx1036. Following log shows that the executable doesn’t contain code object for gfx1036. How do you build your test?
:1:hip_code_object.cpp      :505 : 99442499995 us: 350457: [tid:0x7f02e57efc00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices!
:1:hip_code_object.cpp      :507 : 99442499997 us: 350457: [tid:0x7f02e57efc00]   Devices:
:1:hip_code_object.cpp      :509 : 99442499998 us: 350457: [tid:0x7f02e57efc00]     amdgcn-amd-amdhsa--gfx1100 - [Found]
:1:hip_code_object.cpp      :509 : 99442499999 us: 350457: [tid:0x7f02e57efc00]     amdgcn-amd-amdhsa--gfx1100 - [Found]
:1:hip_code_object.cpp      :509 : 99442500001 us: 350457: [tid:0x7f02e57efc00]     amdgcn-amd-amdhsa--gfx1036 - [Not Found]
:1:hip_code_object.cpp      :514 : 99442500003 us: 350457: [tid:0x7f02e57efc00]   Bundled Code Objects:
:1:hip_code_object.cpp      :530 : 99442500004 us: 350457: [tid:0x7f02e57efc00]     host-x86_64-unknown-linux - [Unsupported]
:1:hip_code_object.cpp      :527 : 99442500006 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx1010 - [code object targetID is amdgcn-amd-amdhsa--gfx1010]
:1:hip_code_object.cpp      :527 : 99442500009 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx1030 - [code object targetID is amdgcn-amd-amdhsa--gfx1030]
:1:hip_code_object.cpp      :527 : 99442500011 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx1100 - [code object targetID is amdgcn-amd-amdhsa--gfx1100]
:1:hip_code_object.cpp      :527 : 99442500012 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx1101 - [code object targetID is amdgcn-amd-amdhsa--gfx1101]
:1:hip_code_object.cpp      :527 : 99442500015 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx1102 - [code object targetID is amdgcn-amd-amdhsa--gfx1102]
:1:hip_code_object.cpp      :527 : 99442500017 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx803 - [code object targetID is amdgcn-amd-amdhsa--gfx803]
:1:hip_code_object.cpp      :527 : 99442500019 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx900 - [code object targetID is amdgcn-amd-amdhsa--gfx900]
:1:hip_code_object.cpp      :527 : 99442500020 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx906:xnack- - [code object targetID is amdgcn-amd-amdhsa--gfx906:xnack-]
:1:hip_code_object.cpp      :527 : 99442500022 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx908:xnack- - [code object targetID is amdgcn-amd-amdhsa--gfx908:xnack-]
:1:hip_code_object.cpp      :527 : 99442500023 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx90a:xnack+ - [code object targetID is amdgcn-amd-amdhsa--gfx90a:xnack+]
:1:hip_code_object.cpp      :527 : 99442500026 us: 350457: [tid:0x7f02e57efc00]     hipv4-amdgcn-amd-amdhsa--gfx90a:xnack- - [code object targetID is amdgcn-amd-amdhsa--gfx90a:xnack-]
:1:hip_code_object.cpp      :534 : 99442500027 us: 350457: [tid:0x7f02e57efc00] hipErrorNoBinaryForGpu: Unable to find code object for all current devices! - 209
:1:hip_fatbin.cpp           :265 : 99442500029 us: 350457: [tid:0x7f02e57efc00] hipErrorNoBinaryForGpu: Couldn't find binary for current devices! - 209
:3:hip_platform.cpp         :670 : 99442500035 us: 350457: [tid:0x7f02e57efc00] init: Returned hipErrorNoBinaryForGpu :  
  1. It seems stuck when copying code object from host to device. You can try to disable power feature when loading amdgpu driver, then run your test.
$ sudo vim /etc/default/grub
add "amdgpu.ppfeaturemask=0xffff3fff amdgpu.runpm=0x0" into GRUB_CMDLINE_LINUX_DEFAULT
$ sudo update-grub
$ reboot
$ cat /proc/cmdline
see if the modification takes effect