tt-metal: All Gather test hangs after many runs in loop

This hang was discovered when trying a potential workaround (disabling program caching) for #6363.

snijjar/issue-6363

Disable program caching

The following config hangs reliably after about 120 loop iterations

tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py::test_all_gather_on_t3000_post_commit_looping[500-mem_config1-input_dtype0-8-1-input_shape2-3-layout2] (you may need to manually update the num_iters argument parametrization to support 500)

This hang was seen during a local run of the post-commit test cases: pytest tests/tt_eager/python_api_testing/unit_testing/misc/test_all_gather.py -k "post_commit"

It may be reproducible with the standalone test command listed first.

Enabling watcher avoids the hang.

Note on priority. There are three concerns with this:

This isn’t strictly a blocker for merging this development branch into maiin as I can artificially reduce the number of loop iterations we run for the all-gather test
This has a reasonable probability of hitting in real workloads given back to back commands being run in sequence
There is a “workaround” but I don’t think it’s one we can reliably apply. It doesn’t hang with watcher enabled but we’ll then want to enable watcher on all models until this is fixed (or maybe just conditionally on hang), hypothetically.

UPDATE:

Bidirectional all-gather support was temporarily disabled on main. For testing with fast-dispatch 2, please re-enable this feature. You can revert the line change in tt_eager/tt_dnn/op_library/all_gather/all_gather_op.hpp from commit b989c269a64c6d40a7523a94f3c4ba16b1eefd20 or ask me for assistance, before retesting.

About this issue

Original URL
State: open
Created 4 months ago
Comments: 31 (22 by maintainers)

Commits related to this issue

#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago
#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago
#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago
#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago
#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago
#6388: temporarily disable bidirectional support for all-gather This feature seems to expose a fast dispatch hang. The suspected delta between bidirectional and unidirectional that could expose an is... — committed to tenstorrent/tt-metal by SeanNijjar 3 months ago

Most upvoted comments

Small update here. Even after this gets triaged, I think it will still be a P0. It looks like it’s affecting a branch of mine (about to open a PR) that has sharded allgather optimizations. This branch is needed because current sharded all-gather support is purely functional/extremely non-performant.

SeanNijjar on Mar 25, 2024

@jliangTT Can we escalate this to P0? Any hangs we are seeing with models can’t be debugged unless all_gather is stable.

TT-BrianLiu on Mar 20, 2024