tt-metal: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole]

ttl.tensor.reduce_max_w operation breaks with low PCC error in some test cases.

To Reproduce

Steps to reproduce the behavior:

Checkout main branch
Run unit test test_reduce_max_w.py using this command: pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_reduce_max_w.py

Expected behavior

There are 6 test cases presented in the unit test test_reduce_max_w.py and they all are expected to fail with low PCC error. For example, one of the tests is expected to fail with this result: Max ATOL Delta: 184.0, Max RTOL Delta: 2.234375, PCC: 0.05683090068563086, Equal check failed

Getting Additional info for the operation under test and its behavior

To get additional information and results for different combinations of input shapes, types, layouts and memory configs for which this operation was tested you can also run locally sweeps for ttl.tensor.ne and check the results. To do this you should:

Follow the Getting Started page to setup the repo, environment variables and python-env
Activate source build/python_env/bin/activate
Run sweeps by using python tests/tt_eager/python_api_testing/sweep_tests/run_pytorch_test.py -i tests/tt_eager/python_api_testing/sweep_tests/test_configs/ci_sweep_tests_broken/wormhole/pytorch_reduce_max_w_test.yaml -o ./result-sweeps
After the run is completed all test sweeps results should be available inside specified output directory (in this case ./result-sweeps). There you will find reduce_max_w_sweep.csv which holds all executed sweeps, among which you can also find the ones that failed and were recreated by the unit test, which you can get by searching unique data_seed field.

About this issue

Original URL
State: closed
Created 8 months ago
Comments: 17 (9 by maintainers)

Commits related to this issue

#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: ttl.tensor.reduce_max_w operation breaks with low PCC [Wormhole] - Wormhole reduce on last dim needs a transpose work-around - Real bug fix is tracked elsewhere #3262 #3605: ttl.tensor.std_hw ... — committed to tenstorrent/tt-metal by muthutt 7 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: remove transpose after fix : fix tests — committed to tenstorrent/tt-metal by deleted user 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago
#3178: remove transpose after fix : fix tests — committed to tenstorrent/tt-metal by deleted user 5 months ago
#3178: Fix for wormhole b0 reduce w — committed to tenstorrent/tt-metal by rtawfik01 5 months ago

Most upvoted comments

Hi @muthutt @davorchap , I got a bug fix here: 043e8c5eebb522915ed0cb25bfa5ef9615b11f68

I tested it using:

pytest tests/tt_eager/python_api_testing/non_working_unit_tests/wormhole/test_reduce_max_w.py

and it all passes. The issue was that for REDUCE_ROW mode, Grayskull has the transpose of SrcA register on the math thread, but wormhole B0 has the transpose of SrcA register on the unpack thread, and its configurable using a flag. I set those flags for wormhole B0.

Please let me know if you have any other issues, I can push that fix once you confirm it works on all other max reduce w tests.

rtawfik01 on Jan 23, 2024