tensorflow-upstream: Memory Access Faults during training

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 19.1, kernel 4.15.0-47-generic
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): rocm2.3-tf1.13-python3 from docker
TensorFlow version (use command below): 1.13
Python version: 3.52
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
ROCm/MIOpen version: 2.3 from docker
GPU model and memory: Radeon VII, 16GB

Describe the current behavior I’m experiencing memory access faults like this:

Memory access fault by GPU node-1 (Agent handle: 0x14e4100) on address 0x7f8a068cb000. Reason: Page not present or supervisor privilege.

in various models. They usually happen during the first 20-45 minutes of training. I am experiencing them across a variety of environments (on upstream kernel 5.0 without rock-dkms, 4.15.0-47-generic with rock-dkms, the #316 docker and the rocm2.3-tf1.13-python3 docker). Due to these errors I couldn’t finish training a network yet as they either crash with a shape related error (#325) or fail with the memory access fault.

Describe the expected behavior No memory errorrs.

Code to reproduce the issue This file contains the minimal code I could replicate the issue with. In my experience it takes up to 45 minutes until the error occurs.

Other info / logs I am attempting to reproduce the issue with

export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2

as per #302 but could not observe another instance yet, ~as the flags seem to be slowing things down quite a lot~, currently at step 10k with no crash, without the flags the crash happened at 8k.

Following #394 I also ran the HIP tests, where all but the native fp16 test pass.

rocm-dev.txt rocm_info.txt

About this issue

Original URL
State: open
Created 5 years ago
Comments: 21

Most upvoted comments

Please see my post here for more details about what may be happening with your fans. To wit, my first guess is: your GPU card vendor set the acoustic limit of your fans too low, and possibly set the thermal throttling limit of your GPU too high. As the GPU heats up, that temperature bleeds over to the HBM that’s situated next to the GPU chip. As the HBM memory starts to get too hot, it starts to get corruptions before the periodic refresh cycle comes around – and you end up with corrupted memory causing a corrupted pointer and thus a crash.

This is a hypothesis, however, and it will be a little difficult to test. However, could you answer the following questions? These aren’t to try to lay blame on you, I just want to make sure of the system setup before I start doing any deep dives to try to solve the problem. 😃

What GPU vendor is this from? You mentioned that it’s a Radeon VII, but was this sold by e.g. PowerColor, Sapphire, ASRock, MSI, XFX, Gigabyte
Are you overclocking the card in any way? This includes using the OverDrive functionality available in rocm-smi.
Are you forcing frequencies in any way? This includes using rocm-smi --setsclk, or making any VBIOS modifications
When you’ve tested rock-dkms, which version was this? 2.2? 2.3?
Are you observing this problem directly after a reboot, or is there any pattern you follow to cause this problem to manifest?
Would you be willing to make custom driver modifications to help print out some debug values? See the last section of my response above about thermal throttling for the type of work this would entail.
Could I ask you to attach a copy of your pptable to this ticket? To get this, please open a root prompt and run cat /sys/class/drm/card0/device/pp_table > ~/radeon_vii_pp_table.bin. You may need to zip this (I don’t remember github’s attachment policies)

jlgreathouse on Apr 23, 2019