tt-metal: WH: slow dispatch post commit corrupts board FW (PCIe / ARC)

@muthutt observed this so he can provide more details on: machine, branch, hash commit. (see below)

The test reports this error: Read 0xffffffff from ARC scratch[6]: you should reset the board. – However, the board can’t be reset (see below).

image

uthu@e04cs05:~/tt-metal$ tt-smi -wr wait all
Caught Exception Read 0xffffffff from ARC scratch[6]: you should reset the board. when trying to initialize device on pci:0; Continuing without device...
⠦ Detecting Tenstorrent devices...
⠋ Failed to initialize device on pci:0
No chips detected, exiting

machine: ssh muthu@172.27.28.130 branch: main on at SHAID 632910be197253ac8f48d47b042b4e6a22b1ea0b (https://github.com/tenstorrent-metal/tt-metal/commit/632910be197253ac8f48d47b042b4e6a22b1ea0b)

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 28 (13 by maintainers)

Most upvoted comments

I believe there are some wormhole tests in UMD, could we try running those on this machine?

I was not able to repro this error.

However, I don’t believe I ever was able to properly run post commit for WH on this machine.

I believe there are some wormhole tests in UMD, could we try running those on this machine?

@tt-rkim that’s a great idea, we should run UMD tests as a first job in the post commit workflow.

I will try a force flash and reboot and see if that helps. Then re-run the tests

Can confirm hanging tests as Mo has said on this machine.

Hm then your configuration matches the working machine configuration I see…

I wasn’t able to repro this.

@muthutt , @kkwong10 , @abhullar-tt , @DrJessop and @tt-rkim have any of you repro’d this w/ new UMD + reset at init – in the last couple of days?

if not we should close the case.