tt-metal: All Gather test hangs after many runs in loop

This hang was discovered when trying a potential workaround (disabling program caching) for #6363.

snijjar/issue-6363

Disable program caching

The following config hangs reliably after about 120 loop iterations

tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[500-mem_config1-input_dtype0-8-1-input_shape2-3-layout2] (you may need to manually update the num_iters argument parametrization to support 500)

This hang was seen during a local run of the post-commit test cases: pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py -k "post_commit"

It may be reproducible with the standalone test command listed first.

Enabling watcher avoids the hang.

Note on priority. There are three concerns with this:

  • This isn’t strictly a blocker for merging this development branch into maiin as I can artificially reduce the number of loop iterations we run for the all-gather test
  • This has a reasonable probability of hitting in real workloads given back to back commands being run in sequence
  • There is a “workaround” but I don’t think it’s one we can reliably apply. It doesn’t hang with watcher enabled but we’ll then want to enable watcher on all models until this is fixed (or maybe just conditionally on hang), hypothetically.

UPDATE:

Bidirectional all-gather support was temporarily disabled on main. For testing with fast-dispatch 2, please re-enable this feature. You can revert the line change in tt_eager/tt_dnn/op_library/all_gather/all_gather_op.hpp from commit b989c269a64c6d40a7523a94f3c4ba16b1eefd20 or ask me for assistance, before retesting.

About this issue

  • Original URL
  • State: open
  • Created 4 months ago
  • Comments: 31 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Small update here. Even after this gets triaged, I think it will still be a P0. It looks like it’s affecting a branch of mine (about to open a PR) that has sharded allgather optimizations. This branch is needed because current sharded all-gather support is purely functional/extremely non-performant.

@jliangTT Can we escalate this to P0? Any hangs we are seeing with models can’t be debugged unless all_gather is stable.