tt-metal: [Bug Report] Matmul 1d/2d non-deterministic output and hangs when run in a loop

Describe the bug We observe non-deterministic output of falcon_lm_head_matmul for 7b when we run it in a loop with random inputs.

To Reproduce

from loguru import logger

import tt_lib as ttl
from models.utility_functions import comp_pcc, tt2torch_tensor
import torch

def test_reproduce_lm_head_nd_32(
    device,
):
    in0_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.L1)
    in1_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.DRAM)
    out_mem_config = ttl.tensor.MemoryConfig(ttl.tensor.TensorMemoryLayout.INTERLEAVED, ttl.tensor.BufferType.L1)

    in0_dtype = ttl.tensor.DataType.BFLOAT16
    in1_dtype = ttl.tensor.DataType.BFLOAT8_B
    out_dtype = ttl.tensor.DataType.BFLOAT16

    torch.manual_seed(1234)

    seq_len = 32
    a_shape = [1, 1, seq_len, 4544]
    b_shape = [1, 1, 4544, 65024]

    A = torch.randn(a_shape)
    B = torch.randn(b_shape) - 0.95

    a_t = ttl.tensor.Tensor(A, in0_dtype).to(ttl.tensor.Layout.TILE).to(device, in0_mem_config)
    b_t = ttl.tensor.Tensor(B, in1_dtype).to(ttl.tensor.Layout.TILE).to(device, in1_mem_config)
    bias_t = None

    out = ttl.tensor.falcon_lm_head_matmul(a_t, b_t, bias_t, output_mem_config=out_mem_config, output_dtype=out_dtype)

    ref_out = tt2torch_tensor(out)

    nd_output_count = 0

    for _ in range(100):
        out.deallocate(True)
        out = ttl.tensor.falcon_lm_head_matmul(a_t, b_t, bias_t, output_mem_config=out_mem_config, output_dtype=out_dtype)

        pt_out = tt2torch_tensor(out)

        _, output_pcc = comp_pcc(ref_out, pt_out, 1)

        if output_pcc != 1:
            nd_output_count += 1

        logger.debug(f"Output pcc={output_pcc}")

    print(f"Iterations with nd output: {nd_output_count}")

Additional context

We get nd output ~13 times when running a loop of 100 runs
If we disable l1 accumulation, it drops to a couple of times in 1000 runs
slow dispatch doesn’t help
running with 500Hz helps (idea from this issue, which seems to be related)
other seq lens have deterministic output
@skhorasganiTT noticed same behavior with mlp h_t_4h matmul which is also matmul 1D (issue, this is probably affecting nd 7b demo hang, so putting P1 critical)

About this issue

Original URL
State: closed
Created 3 months ago
Comments: 24 (15 by maintainers)

Commits related to this issue

#5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
#5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
#5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
#5249: Resolve nd output and hangs with 2048 seq len prefill - Observed nd behavior and hangs with 1d/2d matmuls (issue #7066); - Added determinism test to validate the issue is mitigated — committed to tenstorrent/tt-metal by s-jovic 3 months ago
#7066: Add test to reproduce matmul with nd behavior — committed to tenstorrent/tt-metal by s-jovic 2 months ago
#7066: Add test to reproduce matmul with nd behavior — committed to tenstorrent/tt-metal by s-jovic 2 months ago

Most upvoted comments

Hey guys, here are my findings so far, for both the falcon lm head and the matmul 1d non-determinism.

Non-deterministic PCCs only stop occurring when AICLK is in the 500-620Mhz range/0.95V. Once >620Mhz, non-deterministic PCCs start occurring, and around 900-1000Mhz, hangs start occurring.
Once stalls/cycles are added on the unpacker thread in the matmul loop:

                    for (uint32_t inner_dim_idx = 0; inner_dim_idx < in0_block_w; ++inner_dim_idx) {
                        // matmul outer product of (out_subblock_h x out_subblock_w) tiles that fill dst
                        // accumulation is done by iterating matmul_block across inner dim
                        // in0_block_w is passed as innder dim (kt) to matmul_block, interally used to stride in0
                        matmul_block(in0_cb_id, in1_cb_id, in0_index, in1_index, dst_index, false, out_subblock_w, out_subblock_h, in0_block_w);
                        in0_index ++;  // stride right by 1
                        in1_index += in1_per_core_w; // to stride down by 1 need to stride by in_per_core_w (should be called in1_block_w)
                    }

The hang/non-determinism stops occurring, If the stalls are added elsewhere in the kernel, the issue still persists.

For the matmul 1d hang, the issue stops occurring specifically when stalling TTI_STALLWAIT(p_stall::STALL_UNPACK, p_stall::UNPACK0); in the matmul block loop, adding a stallwait on the other unpack instance UNPACK1, still causes non-deterministic issues.

In summary, since the issue occurs for an AICLK that is lower than what was required to fix the di/dt, this might be a real issue occurring from a thread not completing while executing new configs. I’ll keep investigating further and update if there is a fix.

rtawfik01 on Apr 18, 2024

@uaydonat, Both have been tried. Controlled experiment on Reem’s machine was done just to increase the margin.

@pavlejosipovic tested with both margin and 1GHz clock (basically, new FW) and got the same results, so seems like new FW is fixing these problems as well, although its not super intuitive why as pcc randomness and very low passing point (620MHz) are unusual symptoms for a di/dt event (at least to me). I am taking this conversation further with syseng team that’s more knowledgable of this so I will update when more info is there.

Milos

ttmtrajkovic on May 1, 2024

Spoke to @uaydonat and he mentioned that this issue is non-blocking for this falcon7B demo at the moment and that a workaround exist. We will downgrade this to p1 for now as we continue to investigate.

jliangTT on Apr 9, 2024