tt-metal: WH: slow dispatch post commit corrupts board FW (PCIe / ARC)
@muthutt observed this so he can provide more details on: machine, branch, hash commit. (see below)
The test reports this error: Read 0xffffffff from ARC scratch[6]: you should reset the board. – However, the board can’t be reset (see below).
uthu@e04cs05:~/tt-metal$ tt-smi -wr wait all
Caught Exception Read 0xffffffff from ARC scratch[6]: you should reset the board. when trying to initialize device on pci:0; Continuing without device...
⠦ Detecting Tenstorrent devices...
⠋ Failed to initialize device on pci:0
No chips detected, exiting
machine: ssh muthu@172.27.28.130 branch: main on at SHAID 632910be197253ac8f48d47b042b4e6a22b1ea0b (https://github.com/tenstorrent-metal/tt-metal/commit/632910be197253ac8f48d47b042b4e6a22b1ea0b)
About this issue
- Original URL
- State: closed
- Created 10 months ago
- Comments: 28 (13 by maintainers)
I believe there are some wormhole tests in UMD, could we try running those on this machine?
I was not able to repro this error.
However, I don’t believe I ever was able to properly run post commit for WH on this machine.
@tt-rkim that’s a great idea, we should run UMD tests as a first job in the post commit workflow.
I will try a force flash and reboot and see if that helps. Then re-run the tests
Can confirm hanging tests as Mo has said on this machine.
Hm then your configuration matches the working machine configuration I see…
I wasn’t able to repro this.
@muthutt , @kkwong10 , @abhullar-tt , @DrJessop and @tt-rkim have any of you repro’d this w/ new UMD + reset at init – in the last couple of days?
if not we should close the case.