tt-metal: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole]

ttl.tensor.reduce_max_w operation breaks with low PCC error in some test cases.

To Reproduce

Steps to reproduce the behavior:

  1. Checkout main branch
  2. Run unit test test_reduce_max_w.py using this command: pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_reduce_max_w.py

Expected behavior

There are 6 test cases presented in the unit test test_reduce_max_w.py and they all are expected to fail with low PCC error. For example, one of the tests is expected to fail with this result: Max ATOL Delta: 184.0, Max RTOL Delta: 2.234375, PCC: 0.05683090068563086, Equal check failed

Getting Additional info for the operation under test and its behavior

To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for ttl.tensor.ne and check the results. To do this you should:

  1. Follow the Getting Started page to setup the repo, environment variables and python-env
  2. Activate source build/python_env/bin/activate
  3. Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/pytorch_reduce_max_w_test.yaml -o ./result-sweeps
  4. After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find reduce_max_w_sweep.csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 17 (9 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @muthutt @davorchap , I got a bug fix here: 043e8c5eebb522915ed0cb25bfa5ef9615b11f68

I tested it using:

pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_reduce_max_w.py

and it all passes. The issue was that for REDUCE_ROW mode, Grayskull has the transpose of SrcA register on the math thread, but wormhole B0 has the transpose of SrcA register on the unpack thread, and its configurable using a flag. I set those flags for wormhole B0.

Please let me know if you have any other issues, I can push that fix once you confirm it works on all other max reduce w tests.