tt-metal: ND hang of SD unit tests on N300 device
Running SD unit tests with WH_ARCH_YAML on N300 devices non-deterministically hangs.
To repro the issue, switch to main branch and run the following on N300 device:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/ttnn/integration_tests/stable_diffusion
EDIT:
Running the same test with enabling watcher in the fast-dispatch CI raises the std::runtime_error below on tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py (full log):
terminate called after throwing an instance of 'std::runtime_error'
what(): Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
Fatal Python error: Aborted
Thread 0x00007f3744ff9700 (most recent call first):
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 306 in wait
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 558 in wait
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f38db2c1740 (most recent call first):
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 410 in call_wrapper
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 616 in call_wrapper
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 693 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 306 in time_sharded_attention
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 471 in get_attention_scores_opt
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 706 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_basic_transformer_block.py", line 90 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_transformer_2d.py", line 298 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attn_upblock.py", line 153 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py", line 321 in test_cross_attn_up_block_2d_512x512
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>
About this issue
- Original URL
- State: open
- Created 2 months ago
- Comments: 17 (10 by maintainers)
Commits related to this issue
- #7502: skipping 'integration_tests/stable_diffusion' bc of hangs: issue #7560 — committed to tenstorrent/tt-metal by vtangTT 2 months ago
- #7502: skipping 'integration_tests/stable_diffusion' bc of hangs: issue #7560 — committed to tenstorrent/tt-metal by vtangTT 2 months ago
- #7502: skipping 'integration_tests/stable_diffusion' bc of hangs: issue #7560 — committed to tenstorrent/tt-metal by vtangTT 2 months ago
- #7502: skipping 'integration_tests/stable_diffusion' bc of hangs: issue #7560 — committed to tenstorrent/tt-metal by vtangTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Add SD tests to FD nightly with legacy pass — committed to tenstorrent/tt-metal by mtatsumiTT 2 months ago
- #7560: Slow down MMs when running non-perf tests to avoid ND hang. No hang observed on BM machine where perf tests are run. — committed to tenstorrent/tt-metal by AleksKnezevic 2 months ago
- #7560: Slow down MMs when running non-perf tests to avoid ND hang. No hang observed on BM machine where perf tests are run. — committed to tenstorrent/tt-metal by AleksKnezevic 2 months ago
- #7560 Revert temp chane. — committed to tenstorrent/tt-metal by AleksKnezevic 2 months ago
- #7560: Slow down MMs when running non-perf tests to avoid ND hang. No hang observed on BM machine where perf tests are run. — committed to tenstorrent/tt-metal by AleksKnezevic 2 months ago
Hang is still present on both tests. I rebased and pushed
aknezevic/repro_MM_hangThe watcher log of the previous test is different than the one we saw previously. Perhaps a different style of hang? That one can be reproed using a 5-6 op unit test on the same branch:
WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest tests/ttnn/integration_tests/stable_diffusion/test_sharded_attention.py::test_time_sharded_attnention -k 4096@jliangTT, can you please coordinate?
I have found a way to reliably repro on the submodule, trying to further isolate.