tt-metal: Falcon7b (tt-lib) non-deterministic demo hang on nebula x1

The Falcon7b demo randomly hangs during different invocations of the model forward pass (both compile and inference, and both prefill and decode, but usually decode inference). Additionally, the model usually produces non-deterministic and incorrect output before hanging. The hangs / incorrect outputs become more likely as the number of output tokens increases (i.e. more forward passes). The frequency of the hang is machine dependent, but it can occur as often as every 1-4 runs of the demo.

Additional information:

800 MHz clock is being used
This is not a newly introduced bug (first observation was in late Feb/ early March)
The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)
The hang still occurs with slow dispatch (using TT_METAL_SLOW_DISPATCH_MODE=1)
The hang still occurs after forcing all ops to be blocking (by hacking HWCommandQueue::enqueue_command)
The last op running before hanging is inconsistent, but has been observed to be (from most frequent to least): the lm-head Matmul op, the RotaryEmbedding op, the EltwiseBinary Add op. All of these have DRAM-interleaved inputs and outputs under the default model config in the demo, and there are no sharded ops in the model, making dram-interleaved ops potential culprits
The hang/ND-outputs has not yet been observed using TT_METAL_WATCHER=1, making timing a potential culprit
The hang/ND-outputs occurs more often when using TT_METAL_LOGGER_TYPES=Op TT_METAL_LOGGER_LEVEL=DEBUG

Instructions to stress-test demo:

Commit: b5fe44ddf7631e3d59cb953c238666113a76913d bash models/demos/falcon7b/tests/run_demo_test.sh

About this issue

Original URL
State: open
Created 3 months ago
Comments: 15 (10 by maintainers)

Commits related to this issue

#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved Signed-off-by: Salar <skhorasgani@tenstorrent.com> — committed to tenstorrent/tt-metal by skhorasganiTT 3 months ago
Revert "#6795: Add grid size option to get_mcast_1d_config, set falcon7b matmul grid size on wh to 8x7 until hang issue is resolved" This reverts commit e766e2dbb3e59b79cb856bee1a0481e0fb016c0a. — committed to tenstorrent/tt-metal by rtawfik01 2 months ago

Most upvoted comments

Probably a timing issue. L1 accum is supposed to be faster?

TT-BrianLiu on Apr 2, 2024

i think we should have @TT-BrianLiu starting taking a look at the matmul behaviorial. Triaging to op_cat: mm queue

jliangTT on Apr 2, 2024