tensorflow-upstream: Radeon VII memory access fault training

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mint Linux, Kernel 4.15.0-52
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.13.1
  • Python version: 3.65
  • ROCm/MIOpen version: 2.5.27
  • GPU model and memory: Radeon VII

Describe the current behavior

Fine-tuning BERT using this checkpoint or that checkpoint on a task with a considerably large number of labels (2000) crashes with a memory access fault in 70% of the cases:

Memory access fault by GPU node-1 (Agent handle: 0x55b34b19c5b0) on address 0x7f0cc7dff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
Memory access fault by GPU node-1 (Agent handle: 0x55fb4ce57a60) on address 0x7f6757dff000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

In contrast to #414, temperature is not the issue here, edge and memory are between 40°C and 50°C, junction around 70-80°C.


In the other 30%, it crashes with #325:

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 800 values, but the requested shape has 1137887642217143831
	 [[node gradients/bert/encoder/layer_0/output/LayerNorm/moments/mean_grad/Reshape (defined at /home/tobi/bert-sticker/optimization.py:71) ]]

The same code runs just fine on Nvidia M60, P100 and 1070.

Describe the expected behavior There should be no memory corruption issues.

Code to reproduce the issue

I’m looking into producing a simpler reproducer.

About this issue

Most upvoted comments

I modified run_classifier.py to work for sequence labeling. The label set is ~2k labels. The dataset is not public, I am putting something together that I can share.

@micmelesse would try reproduce on Radeon VII, earlier we were running on MI parts

dmesg of two crashes:

[42145.304293] gmc_v9_0_process_interrupt: 11 callbacks suppressed
[42145.304298] amdgpu 0000:03:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32771, for process  pid 0 thread  pid 0)
[42145.304301] amdgpu 0000:03:00.0:   in page starting at address 0x00007f171c1ff000 from 27
[42145.304303] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00801030
[42145.304309] Evicting PASID 32771 queues
[42145.469445] Started evicting pasid 32771
[42145.469448] Finished evicting pasid 32771
[42252.230309] perf: interrupt took too long (4011 > 3912), lowering kernel.perf_event_max_sample_rate to 49750
[42349.843335] Signal event wasn't created because limit was reached
[42990.023970] Signal event wasn't created because limit was reached
[43168.584558] Signal event wasn't created because limit was reached
[43210.625043] amdgpu 0000:03:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32771, for process  pid 0 thread  pid 0)
[43210.625046] amdgpu 0000:03:00.0:   in page starting at address 0x00007fc4f29ff000 from 27
[43210.625048] amdgpu 0000:03:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00801030

+@micmelesse

@micmelesse, could you try the checkpoints?