tt-metal: Llama2-70B Prefill on 2K sequence Length Non-deterministic Hang
Description We observer ND hanging with Llama2-70B models during prefill for 2K sequence length on 80 layers, occurs around after refilling 6 to 10 users. We notice the following behavior:
- Hang does not occur with watcher enabled which is similar to what #6795 reported.
- Hang does not occur with smaller sequence length (e.g. 128)
- Hang occur regardless of program cache enabled or disabled (note that with program cache enabled, the hang usually occurs earlier).
Our work is in
model-team/demo-perf, which was rebased to main yesterday.
To Reproduce
Steps to reproduce the behavior:
Device: T3000
Checkout to model-team/demo-perf
build tt-metal
run
pytest -svv models/demos/llama2_70b/demo/eval.py::test_LlamaModel_demo[wikitext-2k-greedy-tt-70b-T3000-80L] 2>&1 | tee perplexity_80L_watcher.log
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 47 (40 by maintainers)
great, fingers crossed. would love for FD2 to resolve a P0 (or two or…)
@cglagovich let me know if you want to try 500mhz, I can set that up.
This is a hanging test which hung on 800 MHz which we just tried to repro on 1GHz. Not in CI and not run privately by Taps since it’s known to hang