tt-metal: ttnn.repeat throws runtime memory error

Describe the bug When trying to expand a tensor of shape [1, 1, 32, 16] to [1, 1, 32, 16*2048] using ttnn.repeat, the kernel fails with the following error FATAL | 32KB runtime args targeting kernel reader_concat_interleaved_start_id on (x=0,y=0) are too large. Cannot be written as they will run into memory region reserved for result. Max allowable size is 1 KB.

However, the same output tensor can be successfully created using ttnn.repeat_interleave api

To Reproduce Run below test code in your tt-metal environment

import ttnn
import torch
from tests.ttnn.utils_for_testing import assert_with_pcc


device = ttnn.open_device(device_id=0)
torch_input_tensor = torch.randn((1, 1, 32, 16), dtype=torch.bfloat16)
repeat_shape = (1, 1, 1, 2048)

torch_result_repeat_interleave = torch_input_tensor.repeat_interleave(2048, dim=3)
torch_result_repeat = torch_input_tensor.repeat(repeat_shape)

repeat_tensor = torch.randn(repeat_shape, dtype=torch.bfloat16)
repeat_tensor = ttnn.from_torch(repeat_tensor, layout=ttnn.TILE_LAYOUT, device=device)

input_tensor = ttnn.from_torch(torch_input_tensor, layout=ttnn.TILE_LAYOUT, device=device)

repeat_interleaved_output = ttnn.repeat_interleave(input_tensor, 2048, dim=3)
repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
print(repeat_interleaved_output.shape)
assert_with_pcc(torch_result_repeat_interleave, repeat_interleaved_output, 0.9999)
print("Repeat Interleave Passed")

repeat_output = ttnn.repeat(input_tensor, repeat_tensor.shape)
repeat_output = ttnn.to_torch(repeat_output)
print(repeat_output.shape)
assert_with_pcc(torch_result_repeat_interleave, repeat_output, 0.9999)

ttnn.close_device(device)

Expected behavior If ttnn.repeat_interleave passes then it is also expected to work with ttnn.repeat

Screenshots Screenshot 2024-03-13 at 11 57 20 AM

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 26 (23 by maintainers)

Commits related to this issue

Most upvoted comments

@tarafdarTT. Yes, i think we can target early or mid next week and lower it to p1.

I think the p0 status made sense so far because we didn’t have an owner and triage chain isn’t complete. Jim is pretty adamant on identify the next owner for these set of important task so we don’t lose time.

I see. I’m not entirely familiar with and need to look at why ttnn does additional reshaping under the hood, but my test with tt_lib version is passing for your shape case so might just be something in ttnn needs updating.

if there is a non-performant workaround can this be bumped to P1 instead of P0?

Okay, I think this makes sense. We can try and move forward with my suggestions above on implementing interleaved repeat to unblock functionality/basic perf, and once you have the shard specs we can review the requirements for sharded repeat, otherwise we can look into more perf optimizations for interleaved repeat.

@tt-aho both issues are similar. But they differ on how the broadcasting is happening. In this particular issue the last dim of tensor(1, 1, 32, 16) is 16 which is getting broadcasted but in issue #5769 weights have batch_dim= 1 is getting broadcasted to input batch_size 32. In other words the repeat is many to many and #5769 is one-to-many We don’t have a shard specs right now as we are operating in L1 interleaved.

@kpaigwar Hello, just to be sure, did you mean to tag @KalaivaniMCW?