tt-metal: ttnn.repeat throws runtime memory error

Describe the bug When trying to expand a tensor of shape [1, 1, 32, 16] to [1, 1, 32, 16*2048] using ttnn.repeat, the kernel fails with the following error FATAL | 32KB runtime args targeting kernel reader_concat_interleaved_start_id on (x=0,y=0) are too large. Cannot be written as they will run into memory region reserved for result. Max allowable size is 1 KB.

However, the same output tensor can be successfully created using ttnn.repeat_interleave api

To Reproduce Run below test code in your tt-metal environment

import ttnn
import torch
from tests.ttnn.utils_for_testing import assert_with_pcc


device = ttnn.open_device(device_id=0)
torch_input_tensor = torch.randn((1, 1, 32, 16), dtype=torch.bfloat16)
repeat_shape = (1, 1, 1, 2048)

torch_result_repeat_interleave = torch_input_tensor.repeat_interleave(2048, dim=3)
torch_result_repeat = torch_input_tensor.repeat(repeat_shape)

repeat_tensor = torch.randn(repeat_shape, dtype=torch.bfloat16)
repeat_tensor = ttnn.from_torch(repeat_tensor, layout=ttnn.TILE_LAYOUT, device=device)

input_tensor = ttnn.from_torch(torch_input_tensor, layout=ttnn.TILE_LAYOUT, device=device)

repeat_interleaved_output = ttnn.repeat_interleave(input_tensor, 2048, dim=3)
repeat_interleaved_output = ttnn.to_torch(repeat_interleaved_output)
print(repeat_interleaved_output.shape)
assert_with_pcc(torch_result_repeat_interleave, repeat_interleaved_output, 0.9999)
print("Repeat Interleave Passed")

repeat_output = ttnn.repeat(input_tensor, repeat_tensor.shape)
repeat_output = ttnn.to_torch(repeat_output)
print(repeat_output.shape)
assert_with_pcc(torch_result_repeat_interleave, repeat_output, 0.9999)

ttnn.close_device(device)

Expected behavior If ttnn.repeat_interleave passes then it is also expected to work with ttnn.repeat

Screenshots Screenshot 2024-03-13 at 11 57 20 AM

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 26 (23 by maintainers)

Commits related to this issue

#6361: Add native RM implementation of concat. Only restriction is the requirement of aligned pages — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add initial native on device interleaved repeat op and remove concat wrapper version — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add native RM implementation of concat. Only restriction is the requirement of aligned pages — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add initial native on device interleaved repeat op and remove concat wrapper version — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add native RM implementation of concat. Only restriction is the requirement of aligned pages — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add initial native on device interleaved repeat op and remove concat wrapper version — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add native RM implementation of concat. Only restriction is the requirement of aligned pages — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Add initial native on device interleaved repeat op and remove concat wrapper version — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago
#6361: Update ttnn repeat to use correct shapes when formatting output — committed to tenstorrent/tt-metal by tt-aho 3 months ago

Most upvoted comments

@tarafdarTT. Yes, i think we can target early or mid next week and lower it to p1.

I think the p0 status made sense so far because we didn’t have an owner and triage chain isn’t complete. Jim is pretty adamant on identify the next owner for these set of important task so we don’t lose time.

jliangTT on Mar 15, 2024

I see. I’m not entirely familiar with and need to look at why ttnn does additional reshaping under the hood, but my test with tt_lib version is passing for your shape case so might just be something in ttnn needs updating.

tt-aho on Mar 19, 2024

if there is a non-performant workaround can this be bumped to P1 instead of P0?

tarafdarTT on Mar 15, 2024

Okay, I think this makes sense. We can try and move forward with my suggestions above on implementing interleaved repeat to unblock functionality/basic perf, and once you have the shard specs we can review the requirements for sharded repeat, otherwise we can look into more perf optimizations for interleaved repeat.

tt-aho on Mar 15, 2024

@tt-aho both issues are similar. But they differ on how the broadcasting is happening. In this particular issue the last dim of tensor(1, 1, 32, 16) is 16 which is getting broadcasted but in issue #5769 weights have batch_dim= 1 is getting broadcasted to input batch_size 32. In other words the repeat is many to many and #5769 is one-to-many We don’t have a shard specs right now as we are operating in L1 interleaved.

kpaigwar on Mar 15, 2024

@kpaigwar Hello, just to be sure, did you mean to tag @KalaivaniMCW?

kevinwuTT on Mar 15, 2024