tt-metal: WH PCC error for `test_moreh_linear.py::test_moreh_linear_backward` w/ Watcher Enabled

TT_METAL_WATCHER=30 pytest tests/tt_eager/python_api_testing/unit_testing/test_moreh_linear.py::test_moreh_linear_backward

Failure is a PCC error:

            passing, output_pcc = comp_allclose_and_pcc(torch_bias.grad, ttcpu_bias_grad, pcc=0.999, rtol=rtol, atol=atol)
            logger.info(f"bias_grad passing={passing} pcc={output_pcc}")
>           assert passing
E           assert False

tests/tt_eager/python_api_testing/unit_testing/test_moreh_linear.py:164: AssertionError

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 25 (13 by maintainers)

Commits related to this issue

Most upvoted comments

Our engineers investigated that using tile_regs_acquire, tile_regs_wait, tile_regs_commit, and tile_regs_release functions rather than acquire_dst, release_dst functions solves the issue. It is weird that acquire_dst and release_dst are just simple wrappers of those tile_regs_* functions.

Same thing happens for issue #7521. I think we have to make a decision between to options:

  1. Change all acquire_dst and release_dst function calls to use tile_regs_* functions.
  2. Investigate further what is a real problem in acquire_dst and release_dst functions (I think this is out of scope of Moreh’s ability).

@jliangTT , how do you think about this?

hey @dongjin-na , can you please go to this page https://tenstorrent.github.io/tt-metal/latest/tt-metalium/tools/watcher.html#enabling and enable watcher piece-wise to see if you can still reproduce?

Thanks for the investigation @dongjin-na @razorback3 . @jliangTT @jvasilje can you find someone that can help with debug?