tt-metal: Non-deterministic hangs on Grayskull when running on uplifted UMD branch
Running post commit on abhullar/umd
https://github.com/tenstorrent-metal/tt-metal/actions/runs/6102682510/job/16561643635
- Failing test: tests/models/whisper/tests/test_whisper_model.py::test_WhipserModel_inference
- Runner name:
tt-metal-ci-vm-5 - Driver : TTKMD 1.20.1
- FW Date : 2023-06-28
- Family : e150
https://github.com/tenstorrent-metal/tt-metal/actions/runs/6105393608/job/16568840780
- Failing test: gtest
SingleDeviceFixture.AllCoreSingleTileSfpuApproxCompute - Runner name:
temp-f13cs03-large-bm - Driver : TTKMD 1.20.1
- FW Date : 2023-06-28
- Family : e150
Locally ran on cloud machine without any hangs
- Family: e150
- Driver: TTKMD 1.20.1
- FW Date: 2023-06-28
Immediate today, we should do the following to help isolate issues so we can progress to the end of this debug
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 44 (22 by maintainers)
we can make
dram_barrier()l1_barrier()atm UMD does not expose Strict ordering mode, Aditya mentioned he could expose this.
If I only run
run_python_api_unit_tests.shthen I don’t see a hang but when I run the full post commit it still hangs (5th iteration as opposed to 2nd) … which is weird. See above comment for the experiment Im trying right now to isolate if there is some corruption in the c++ tests