tt-metal: ND Hang on several different ops on T3000 (not on N150/N300)
We are seeing a hang for a specific matmul program config.
Hanging config:
"COMPUTE_KERNEL_HIFI4_CONFIG": ttl.tensor.WormholeComputeKernelConfig(
math_fidelity=ttl.tensor.MathFidelity.HiFi4,
math_approx_mode=True,
fp32_dest_acc_en=True,
packer_l1_acc=True,
)
Passing configs:
"COMPUTE_KERNEL_CONFIG": ttl.tensor.WormholeComputeKernelConfig(
math_fidelity=ttl.tensor.MathFidelity.LoFi,
math_approx_mode=True,
fp32_dest_acc_en=True,
packer_l1_acc=True,
)
"COMPUTE_KERNEL_HIFI4_CONFIG_FP16_DEST": ttl.tensor.WormholeComputeKernelConfig(
math_fidelity=ttl.tensor.MathFidelity.HiFi4,
math_approx_mode=True,
fp32_dest_acc_en=False,
packer_l1_acc=True,
)
Thus, switching from LoFi (COMPUTE_KERNEL_CONFIG) to HiFi4 (COMPUTE_KERNEL_HIFI4_CONFIG) results in a hang;
When using HiFi4, switching from fp32_dest_acc_en=False (COMPUTE_KERNEL_HIFI4_CONFIG_FP16_DEST) to fp32_dest_acc_en=True (COMPUTE_KERNEL_HIFI4_CONFIG) results in a hang.
To Reproduce
(1)
Checkout jrock/prefill-partial-attn-pr-hang-repro
and run pytest models/demos/falcon40b/tests/test_falcon_end_to_end.py::test_FalconCausalLM_end_to_end_with_program_cache[BFLOAT8_B-DRAM-falcon_40b-layers_12-prefill_seq2048-8chips-disable_program_cache]
The test should pass.
(2)
Revert the last commit 203ea038db6beaa4e8b0f872a961bc28bcd626b3
and run the test again: pytest models/demos/falcon40b/tests/test_falcon_end_to_end.py::test_FalconCausalLM_end_to_end_with_program_cache[BFLOAT8_B-DRAM-falcon_40b-layers_12-prefill_seq2048-8chips-disable_program_cache]
This will produce the hang. Only difference is, whether we use COMPUTE_KERNEL_HIFI4_CONFIG vs COMPUTE_KERNEL_CONFIG for the first attention matmul (Q*KˆT)
Impact
This is not blocking us since we can use the passing compute configs. However, producing a hang by switching to HiFi4 and/or enabling fp32_dest_acc_en should be investigated and fixed.
About this issue
- Original URL
- State: closed
- Created 2 months ago
- Comments: 27 (15 by maintainers)
@johanna-rock-tt for repro this needs to be run on T3000? It would be a lot better if we can factor out this mm into a single UT which hangs on single device (chances of someone picking it up would be much higher).
@jliangTT that’s correct - it doesnt seem connected to a specific op
I’ve tried running this today with both softmax and all_gather commented out (leaving out I2Spartial -> MM1 -> MM2 -> S2IPartial sequence), and the hang was still happening. Following that, I’ve updated the test to be reproducible on a single chip on T3000, and introduced explicit sync points after each op via
tt_lib.device.Synchronize()) Turns out, the hang can occur after every op in the sequence (I2SP, MM1, MM2 or S2IP), completely non-deterministically. I’ve also run the same unit test on N150 and @pavlejosipovic has run it on N300 (with 8x8 grid), and there hang isn’t observable even after 100k loops. I’ve also run this on 500 MHZ on T3000, and the hang didn’t go away.Here’s the updated unit test: test_hang.txt (Please rename to .py when running, as GitHub doesn’t allow *.py attachments) Run command: pytest test_hang.py -k “test_hanging_attn and 8000_loops and 1chips” (make sure to set
WH_ARCH_YAML=“wormhole_b0_80_arch_eth_dispatch.yaml”before running the test)Does anybody have any idea where to go next with investigation?
Using watcher revealed that the hang is either all_gather or fast dispatch. MM just exposed it. Will update more on Monday
Assigning to @jliangTT to reassign / prioritize.