tt-metal: Llama2-70B Prefill on 2K sequence Length Non-deterministic Hang

Description We observer ND hanging with Llama2-70B models during prefill for 2K sequence length on 80 layers, occurs around after refilling 6 to 10 users. We notice the following behavior:

Hang does not occur with watcher enabled which is similar to what #6795 reported.
Hang does not occur with smaller sequence length (e.g. 128)
Hang occur regardless of program cache enabled or disabled (note that with program cache enabled, the hang usually occurs earlier). Our work is in model-team/demo-perf, which was rebased to main yesterday.

To Reproduce Steps to reproduce the behavior: Device: T3000 Checkout to model-team/demo-perf build tt-metal run

pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L] 2>&1 | tee perplexity_80L_watcher.log

About this issue

Original URL
State: open
Created 3 months ago
Comments: 47 (40 by maintainers)

Most upvoted comments

great, fingers crossed. would love for FD2 to resolve a P0 (or two or…)

pgkeller on May 6, 2024

@cglagovich let me know if you want to try 500mhz, I can set that up.

tapspatel on May 3, 2024

This is a hanging test which hung on 800 MHz which we just tried to repro on 1GHz. Not in CI and not run privately by Taps since it’s known to hang

cglagovichTT on May 3, 2024