tt-metal: ND hang of SD unit tests on N300 device

Running SD unit tests with WH_ARCH_YAML on N300 devices non-deterministically hangs.

To repro the issue, switch to main branch and run the following on N300 device:

WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml pytest tests/ttnn/integration_tests/stable_diffusion

EDIT: Running the same test with enabling watcher in the fast-dispatch CI raises the std::runtime_error below on tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py (full log):

terminate called after throwing an instance of 'std::runtime_error'
  what():  Read 0xffffffff from ARC scratch[6]: auto-reset succeeded.
Fatal Python error: Aborted
Thread 0x00007f3744ff9700 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 306 in wait
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 558 in wait
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/tqdm/_monitor.py", line 60 in run
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 932 in _bootstrap_inner
  File "/home/ubuntu/actions-runner/_work/_tool/Python/3.8.18/x64/lib/python3.8/threading.py", line 890 in _bootstrap
Thread 0x00007f38db2c1740 (most recent call first):
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 410 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 616 in call_wrapper
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/ttnn/ttnn/decorators.py", line 693 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 306 in time_sharded_attention
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 471 in get_attention_scores_opt
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attention.py", line 706 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_basic_transformer_block.py", line 90 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_transformer_2d.py", line 298 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/models/experimental/functional_stable_diffusion/tt2/ttnn_functional_cross_attn_upblock.py", line 153 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/tests/ttnn/integration_tests/stable_diffusion/test_cross_attn_up_block_2d.py", line 321 in test_cross_attn_up_block_2d_512x512
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 195 in pytest_pyfunc_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/python.py", line 1789 in runtest
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 167 in pytest_runtest_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 260 in <lambda>
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 339 in from_call
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 220 in call_and_report
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 131 in runtestprotocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/runner.py", line 112 in pytest_runtest_protocol
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 324 in _main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/lib/python3.8/site-packages/_pytest/config/__init__.py", line 190 in console_main
  File "/home/ubuntu/actions-runner/_work/tt-metal/tt-metal/python_env/bin/pytest", line 8 in <module>

fyi @AleksKnezevic @vtangTT @TT-billteng

About this issue

  • Original URL
  • State: open
  • Created 2 months ago
  • Comments: 17 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Hang is still present on both tests. I rebased and pushed aknezevic/repro_MM_hang

The watcher log of the previous test is different than the one we saw previously. Perhaps a different style of hang? That one can be reproed using a 5-6 op unit test on the same branch: WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml TT_METAL_WATCHER=1 TT_METAL_WATCHER_DISABLE_NOC_SANITIZE=1 pytest tests/ttnn/integration_tests/stable_diffusion/test_sharded_attention.py::test_time_sharded_attnention -k 4096

@jliangTT, can you please coordinate?

I have found a way to reliably repro on the submodule, trying to further isolate.