tt-metal: Llama2-70B Prefill on 2K sequence Length Non-deterministic Hang

Description We observer ND hanging with Llama2-70B models during prefill for 2K sequence length on 80 layers, occurs around after refilling 6 to 10 users. We notice the following behavior:

  • Hang does not occur with watcher enabled which is similar to what #6795 reported.
  • Hang does not occur with smaller sequence length (e.g. 128)
  • Hang occur regardless of program cache enabled or disabled (note that with program cache enabled, the hang usually occurs earlier). Our work is in model-team/demo-perf, which was rebased to main yesterday.

To Reproduce Steps to reproduce the behavior: Device: T3000 Checkout to model-team/demo-perf build tt-metal run

pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L] 2>&1 | tee perplexity_80L_watcher.log

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 47 (40 by maintainers)

Most upvoted comments

great, fingers crossed. would love for FD2 to resolve a P0 (or two or…)

@cglagovich let me know if you want to try 500mhz, I can set that up.

This is a hanging test which hung on 800 MHz which we just tried to repro on 1GHz. Not in CI and not run privately by Taps since it’s known to hang