tt-metal: Llama2 demo hangs while creating layers

Describe the bug When running the Llama2 demo on T3000 (8 chips), I see that my process stops making progress while instantiating layers. All weights have been cached to disk, so creating layers should only consist of loading dumped tensors and pushing them to device.

The strange behavior that I need help with is that every time I run the demo, more of my layers get loaded before hanging. i.e. 15 layers, then 33, then 40, then 51, etc.

Repros are difficult because this seems to only happen when the machine hasn’t been used in a while, and you have to load a large number of cached layers before seeing the hang.

To Reproduce We’ll have to discuss an easier way to repro this. Currently it can only be reproed on sjc-snva-t3002.

Screenshots image

Please complete the following environment information:

  • Commit 9ab58637b73f520082d1b4d36011774f1befeb25
  • Machine sjc-snva-t3002

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 30 (22 by maintainers)

Most upvoted comments

Jim’s preference: don’t de-escalate until we know what is happening. We don’t want landmines like this even if there are workarounds.