tensorflow-upstream: Radeon VII memory access fault training
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint Linux, Kernel 4.15.0-52
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 1.13.1
- Python version: 3.65
- ROCm/MIOpen version: 2.5.27
- GPU model and memory: Radeon VII
Describe the current behavior
Fine-tuning BERT using this checkpoint or that checkpoint on a task with a considerably large number of labels (2000) crashes with a memory access fault in 70% of the cases:
Memory access fault by GPU node-1 (Agent handle: 0x55b34b19c5b0) on address 0x7f0cc7dff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Memory access fault by GPU node-1 (Agent handle: 0x55fb4ce57a60) on address 0x7f6757dff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
In contrast to #414, temperature is not the issue here, edge and memory are between 40°C and 50°C, junction around 70-80°C.
In the other 30%, it crashes with #325:
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 800 values, but the requested shape has 1137887642217143831
[[node gradients/bert/encoder/layer_0/output/LayerNorm/moments/mean_grad/Reshape (defined at /home/tobi/bert-sticker/optimization.py:71) ]]
The same code runs just fine on Nvidia M60, P100 and 1070.
Describe the expected behavior There should be no memory corruption issues.
Code to reproduce the issue
I’m looking into producing a simpler reproducer.
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 28
I modified
run_classifier.pyto work for sequence labeling. The label set is ~2k labels. The dataset is not public, I am putting something together that I can share.@micmelesse would try reproduce on Radeon VII, earlier we were running on MI parts
dmesgof two crashes:+@micmelesse
@micmelesse, could you try the checkpoints?