tt-metal: Non-deterministic hangs on Grayskull when running on uplifted UMD branch

Running post commit on abhullar/umd

https://github.com/tenstorrent-metal/tt-metal/actions/runs/6102682510/job/16561643635

  • Failing test: tests/models/whisper/tests/test_whisper_model.py::test_WhipserModel_inference
  • Runner name: tt-metal-ci-vm-5
  • Driver : TTKMD 1.20.1
  • FW Date : 2023-06-28
  • Family : e150

https://github.com/tenstorrent-metal/tt-metal/actions/runs/6105393608/job/16568840780

  • Failing test: gtest SingleDeviceFixture.AllCoreSingleTileSfpuApproxCompute
  • Runner name: temp-f13cs03-large-bm
  • Driver : TTKMD 1.20.1
  • FW Date : 2023-06-28
  • Family : e150

Locally ran on cloud machine without any hangs

  • Family: e150
  • Driver: TTKMD 1.20.1
  • FW Date: 2023-06-28

Immediate today, we should do the following to help isolate issues so we can progress to the end of this debug

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 44 (22 by maintainers)

Most upvoted comments

with static NOC VC we can do the bellow:

  • designate 4B addr in each DRAM and each L1
  • write the value before LaunchKernels
  • read the value in the loop from all DRAM / L1s to make sure the value has been propagated
  • LaunchKernels
  • reset the value using the same scheme (write and read back)

This would be a barrier because with static VC writes can’t be reorderd – so thus if this special addr has been written all previous writes have to have finished as well

we can make dram_barrier() l1_barrier()

@abhullar-tt @pgkeller and @DrJessop in slow dispatch mode:

  1. write binaries, write CB config, write run-time args – these can go in any order
  2. de-assert reset – this needs wait for all the previous writes in 1) to finish

if we have PCIe strict + static VC for both 1) and 2) this should be enforced. If that’s not the case we need a flush between 1 and 2 – and this as far as I know is not available for a PCIe (mmio) device

atm UMD does not expose Strict ordering mode, Aditya mentioned he could expose this.

Was it a specific test that was hanging deterministically in pytest?

It seems to be tests/tt_eager/python_api_testing/sweep_tests/pytests/tt_dnn/test_permute.py

Ok, so same that we’ve seen hang before. And after these changes, on the same BM, you weren’t able to reproduce a hang?

If I only run run_python_api_unit_tests.sh then I don’t see a hang but when I run the full post commit it still hangs (5th iteration as opposed to 2nd) … which is weird. See above comment for the experiment Im trying right now to isolate if there is some corruption in the c++ tests