tt-metal: Falcon7b (tt-lib) non-deterministic demo hang on nebula x1

The Falcon7b demo randomly hangs during different invocations of the model forward pass (both compile and inference, and both prefill and decode, but usually decode inference). Additionally, the model usually produces non-deterministic and incorrect output before hanging. The hangs / incorrect outputs become more likely as the number of output tokens increases (i.e. more forward passes). The frequency of the hang is machine dependent, but it can occur as often as every 1-4 runs of the demo.

Additional information:

  • 800 MHz clock is being used
  • This is not a newly introduced bug (first observation was in late Feb/ early March)
  • The hang/ND-outputs has never been observed on nebula x2 after 100 runs of the demo (note: experiment is run on single device of t3000), except when forcing 8x8 core grid using WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml, making 8x8 grid size a potential culprit (unless running fast dispatch on idle ethernet cores is causing other issues)
  • The hang still occurs with slow dispatch (using TT_METAL_SLOW_DISPATCH_MODE=1)
  • The hang still occurs after forcing all ops to be blocking (by hacking HWCommandQueue::enqueue_command)
  • The last op running before hanging is inconsistent, but has been observed to be (from most frequent to least): the lm-head Matmul op, the RotaryEmbedding op, the EltwiseBinary Add op. All of these have DRAM-interleaved inputs and outputs under the default model config in the demo, and there are no sharded ops in the model, making dram-interleaved ops potential culprits
  • The hang/ND-outputs has not yet been observed using TT_METAL_WATCHER=1, making timing a potential culprit
  • The hang/ND-outputs occurs more often when using TT_METAL_LOGGER_TYPES=Op TT_METAL_LOGGER_LEVEL=DEBUG

Instructions to stress-test demo:

Commit: b5fe44ddf7631e3d59cb953c238666113a76913d bash models/demos/falcon7b/tests/run_demo_test.sh

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 15 (10 by maintainers)

Commits related to this issue

Most upvoted comments

Probably a timing issue. L1 accum is supposed to be faster?

i think we should have @TT-BrianLiu starting taking a look at the matmul behaviorial. Triaging to op_cat: mm queue