tt-metal: Llama2 demo hangs while creating layers
Describe the bug When running the Llama2 demo on T3000 (8 chips), I see that my process stops making progress while instantiating layers. All weights have been cached to disk, so creating layers should only consist of loading dumped tensors and pushing them to device.
The strange behavior that I need help with is that every time I run the demo, more of my layers get loaded before hanging. i.e. 15 layers, then 33, then 40, then 51, etc.
Repros are difficult because this seems to only happen when the machine hasn’t been used in a while, and you have to load a large number of cached layers before seeing the hang.
To Reproduce We’ll have to discuss an easier way to repro this. Currently it can only be reproed on sjc-snva-t3002.
Screenshots
Please complete the following environment information:
- Commit
9ab58637b73f520082d1b4d36011774f1befeb25 - Machine
sjc-snva-t3002
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 30 (22 by maintainers)
Jim’s preference: don’t de-escalate until we know what is happening. We don’t want landmines like this even if there are workarounds.