tensorflow-upstream: Memory Access Faults during training
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Mint 19.1, kernel 4.15.0-47-generic
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): rocm2.3-tf1.13-python3 from docker
- TensorFlow version (use command below): 1.13
- Python version: 3.52
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- ROCm/MIOpen version: 2.3 from docker
- GPU model and memory: Radeon VII, 16GB
Describe the current behavior I’m experiencing memory access faults like this:
Memory access fault by GPU node-1 (Agent handle: 0x14e4100) on address 0x7f8a068cb000. Reason: Page not present or supervisor privilege.
in various models. They usually happen during the first 20-45 minutes of training. I am experiencing them across a variety of environments (on upstream kernel 5.0 without rock-dkms, 4.15.0-47-generic with rock-dkms, the #316 docker and the rocm2.3-tf1.13-python3 docker). Due to these errors I couldn’t finish training a network yet as they either crash with a shape related error (#325) or fail with the memory access fault.
Describe the expected behavior No memory errorrs.
Code to reproduce the issue This file contains the minimal code I could replicate the issue with. In my experience it takes up to 45 minutes until the error occurs.
Other info / logs I am attempting to reproduce the issue with
export HCC_SERIALIZE_KERNEL=0x3
export HCC_SERIALIZE_COPY=0x3
export HIP_TRACE_API=0x2
as per #302 but could not observe another instance yet, ~as the flags seem to be slowing things down quite a lot~, currently at step 10k with no crash, without the flags the crash happened at 8k.
Following #394 I also ran the HIP tests, where all but the native fp16 test pass.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 21
Please see my post here for more details about what may be happening with your fans. To wit, my first guess is: your GPU card vendor set the acoustic limit of your fans too low, and possibly set the thermal throttling limit of your GPU too high. As the GPU heats up, that temperature bleeds over to the HBM that’s situated next to the GPU chip. As the HBM memory starts to get too hot, it starts to get corruptions before the periodic refresh cycle comes around – and you end up with corrupted memory causing a corrupted pointer and thus a crash.
This is a hypothesis, however, and it will be a little difficult to test. However, could you answer the following questions? These aren’t to try to lay blame on you, I just want to make sure of the system setup before I start doing any deep dives to try to solve the problem. 😃
rocm-smi.rocm-smi --setsclk, or making any VBIOS modificationsrock-dkms, which version was this? 2.2? 2.3?cat /sys/class/drm/card0/device/pp_table > ~/radeon_vii_pp_table.bin. You may need to zip this (I don’t remember github’s attachment policies)