tt-metal: [Bug Report] Matmul 1d/2d non-deterministic output and hangs when run in a loop
Describe the bug We observe non-deterministic output of falcon_lm_head_matmul for 7b when we run it in a loop with random inputs.
To Reproduce
from loguru import logger
import tt_lib as ttl
from models.utility_functions import comp_pcc, tt2torch_tensor
import torch
def test_reproduce_lm_head_nd_32(
device,
):
in0_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.L1)
in1_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.DRAM)
out_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.L1)
in0_dtype = ttl.tensor.DataType.BFLOAT16
in1_dtype = ttl.tensor.DataType.BFLOAT8_B
out_dtype = ttl.tensor.DataType.BFLOAT16
torch.manual_seed(1234)
seq_len = 32
a_shape = [1, 1, seq_len, 4544]
b_shape = [1, 1, 4544, 65024]
A = torch.randn(a_shape)
B = torch.randn(b_shape) - 0.95
a_t = ttl.tensor.Tensor(A, in0_dtype).to(ttl.tensor.Layout.TILE).to(device, in0_mem_config)
b_t = ttl.tensor.Tensor(B, in1_dtype).to(ttl.tensor.Layout.TILE).to(device, in1_mem_config)
bias_t = None
out = ttl.tensor.falcon_lm_head_matmul(a_t, b_t, bias_t, output_mem_config=out_mem_config, output_dtype=out_dtype)
ref_out = tt2torch_tensor(out)
nd_output_count = 0
for _ in range(100):
out.deallocate(True)
out = ttl.tensor.falcon_lm_head_matmul(a_t, b_t, bias_t, output_mem_config=out_mem_config, output_dtype=out_dtype)
pt_out = tt2torch_tensor(out)
_, output_pcc = comp_pcc(ref_out, pt_out, 1)
if output_pcc != 1:
nd_output_count += 1
logger.debug(f"Output pcc={output_pcc}")
print(f"Iterations with nd output: {nd_output_count}")
Additional context
- We get nd output ~13 times when running a loop of 100 runs
- If we disable l1 accumulation, it drops to a couple of times in 1000 runs
- slow dispatch doesn’t help
- running with 500Hz helps (idea from this issue, which seems to be related)
- other seq lens have deterministic output
- @skhorasganiTT noticed same behavior with mlp h_t_4h matmul which is also matmul 1D (issue, this is probably affecting nd 7b demo hang, so putting P1 critical)
About this issue
- Original URL
- State: closed
- Created 3 months ago
- Comments: 24 (15 by maintainers)
Commits related to this issue
- #5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
- #5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
- #5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
- #5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
- #7066: Add test to reproduce matmul with nd behavior — committed to tenstorrent/tt-metal by s-jovic 2 months ago
- #7066: Add test to reproduce matmul with nd behavior — committed to tenstorrent/tt-metal by s-jovic 2 months ago
Hey guys, here are my findings so far, for both the falcon lm head and the matmul 1d non-determinism.
The hang/non-determinism stops occurring, If the stalls are added elsewhere in the kernel, the issue still persists.
TTI_STALLWAIT(p_stall::STALL_UNPACK, p_stall::UNPACK0);in the matmul block loop, adding a stallwait on the other unpack instance UNPACK1, still causes non-deterministic issues.In summary, since the issue occurs for an AICLK that is lower than what was required to fix the di/dt, this might be a real issue occurring from a thread not completing while executing new configs. I’ll keep investigating further and update if there is a fix.
@uaydonat, Both have been tried. Controlled experiment on Reem’s machine was done just to increase the margin.
@pavlejosipovic tested with both margin and 1GHz clock (basically, new FW) and got the same results, so seems like new FW is fixing these problems as well, although its not super intuitive why as pcc randomness and very low passing point (620MHz) are unusual symptoms for a di/dt event (at least to me). I am taking this conversation further with syseng team that’s more knowledgable of this so I will update when more info is there.
Milos
Spoke to @uaydonat and he mentioned that this issue is non-blocking for this falcon7B demo at the moment and that a workaround exist. We will downgrade this to p1 for now as we continue to investigate.