iree: Llama-3-8B f16 fails to compile to vmfb

What happened?

When trying to compile this mlir file, I get the shared memory error below:

failed to translate executables
failed to translate executables
failed to translate executables
result_llama_3_v4.mlir:352:7: error: 'func.func' op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
      func.func @prefill_bs4$async_dispatch_1_generic_4xDx4096_i64xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<128256x4096xf16>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x?xi64>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>) {
      ^
result_llama_3_v4.mlir:346:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_1 {
  ^
result_llama_3_v4.mlir:449:14: error: 'iree_linalg_ext.set_encoding' op unhandled tensor operation
        %7 = linalg.batch_matmul_transpose_b ins(%3, %4 : tensor<4x?x4096xf32>, tensor<4x4096x4096xf16>) outs(%6 : tensor<4x?x4096xf32>) -> tensor<4x?x4096xf32>
             ^
result_llama_3_v4.mlir:440:7: error: 'func.func' op failed to create tensor equivalance classes
      func.func @prefill_bs4$async_dispatch_4_batch_matmul_transpose_b_4xDx4096x4096_f32xf16xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<4x?x4096xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x4096x4096xf16>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>) {
      ^
result_llama_3_v4.mlir:434:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_4 {
  ^
result_llama_3_v4.mlir:504:14: error: 'iree_linalg_ext.set_encoding' op unhandled tensor operation
        %7 = linalg.batch_matmul_transpose_b ins(%3, %4 : tensor<4x?x4096xf32>, tensor<4x1024x4096xf16>) outs(%6 : tensor<4x?x1024xf32>) -> tensor<4x?x1024xf32>
             ^
result_llama_3_v4.mlir:495:7: error: 'func.func' op failed to create tensor equivalance classes
      func.func @prefill_bs4$async_dispatch_7_batch_matmul_transpose_b_4xDx1024x4096_f32xf16xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<4x?x4096xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x1024x4096xf16>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x1024xf32>>) {
      ^
result_llama_3_v4.mlir:489:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_7 {
  ^

Steps to reproduce your issue

Cherry pick iree#17182
Cherry pick llvm-project#90141
…/iree-build/tools/iree-compile --mlir-disable-threading --iree-opt-const-eval=false --compile-to=flow …/batch_llama_3_8B.mlir -o result_llama_3.mlir
…/iree-build/tools/iree-compile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=rocm --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=rocm --iree-rocm-target-chip=gfx940 --iree-global-opt-propagate-transposes=true --iree-opt-const-eval=false --iree-rocm-bc-dir=/opt/rocm/amdgcn/bitcode result_llama_3.mlir -o llama_3.vmfb
Error:

failed to translate executables
failed to translate executables
failed to translate executables
result_llama_3_v4.mlir:352:7: error: 'func.func' op uses -46137344 bytes of shared memory; exceeded the limit of 65536 bytes
      func.func @prefill_bs4$async_dispatch_1_generic_4xDx4096_i64xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<128256x4096xf16>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x?xi64>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>) {
      ^
result_llama_3_v4.mlir:346:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_1 {
  ^
result_llama_3_v4.mlir:449:14: error: 'iree_linalg_ext.set_encoding' op unhandled tensor operation
        %7 = linalg.batch_matmul_transpose_b ins(%3, %4 : tensor<4x?x4096xf32>, tensor<4x4096x4096xf16>) outs(%6 : tensor<4x?x4096xf32>) -> tensor<4x?x4096xf32>
             ^
result_llama_3_v4.mlir:440:7: error: 'func.func' op failed to create tensor equivalance classes
      func.func @prefill_bs4$async_dispatch_4_batch_matmul_transpose_b_4xDx4096x4096_f32xf16xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<4x?x4096xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x4096x4096xf16>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x4096xf32>>) {
      ^
result_llama_3_v4.mlir:434:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_4 {
  ^
result_llama_3_v4.mlir:504:14: error: 'iree_linalg_ext.set_encoding' op unhandled tensor operation
        %7 = linalg.batch_matmul_transpose_b ins(%3, %4 : tensor<4x?x4096xf32>, tensor<4x1024x4096xf16>) outs(%6 : tensor<4x?x1024xf32>) -> tensor<4x?x1024xf32>
             ^
result_llama_3_v4.mlir:495:7: error: 'func.func' op failed to create tensor equivalance classes
      func.func @prefill_bs4$async_dispatch_7_batch_matmul_transpose_b_4xDx1024x4096_f32xf16xf32(%arg0: !flow.dispatch.tensor<readonly:tensor<4x?x4096xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<4x1024x4096xf16>>, %arg2: index, %arg3: !flow.dispatch.tensor<writeonly:tensor<4x?x1024xf32>>) {
      ^
result_llama_3_v4.mlir:489:3: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mma_layout<MFMA_F16_16x16x16_F32>, #iree_gpu.mma_layout<MFMA_F16_32x32x8_F32>], target_arch = "gfx940", ukernels = "none"}>
  flow.executable private @prefill_bs4$async_dispatch_7 {
  ^

What component(s) does this issue relate to?

No response

Version information

f2746b464fb056ddadef4315654d59f727e4c9b0

Additional context

No response

About this issue

Original URL
State: open
Created 2 months ago
Comments: 32 (25 by maintainers)

Commits related to this issue

[linalg] Fix bug for conversion of complex dtype (#3269) The conversion of complex type wasn't supported or checked; the support and required tests were added. Fixes: https://github.com/iree-org... — committed to llvm/torch-mlir by pashu123 2 months ago

Most upvoted comments

With that said, moving the cast across the embedding lookup is a common optimization.

I’m a bit worried that the default path on this generates basically unusable code, though.

More I think about this, it might be worth just doing the fusion of

 %15 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11 : tensor<128256x4096xf16>) outs(%14 : tensor<128256x4096xf32>) {
  ^bb0(%in: f16, %out: f32):
    %17 = arith.extf %in : f16 to f32
    linalg.yield %17 : f32
  } -> tensor<128256x4096xf32>
  %16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) {
  ^bb0(%in: i64, %out: f32):
    %17 = arith.index_cast %in : i64 to index
    %18 = linalg.index 2 : index
    %extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32>
    linalg.yield %extracted : f32
  } -> tensor<4x?x4096xf32>

%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2 : tensor<4x?xi64>) outs(%3 : tensor<4x?x4096xf32>) {
    ^bb0(%in: i64, %out: f32):
      %9 = arith.index_cast %in : i64 to index
      %10 = linalg.index 2 : index
      %extracted = tensor.extract %5[%9, %10] : tensor<128256x4096xf16>
      %extracted_f32 = arith.extf %extracted : f16 to f32
      linalg.yield %extracted_f32 : f32
    } -> tensor<4x?x4096xf32>

as a one-off canonicalization for now to not fall off a cliff. Might be hard to make it future proof, but more examples will help. @IanWood1 just FYI for something for us to discuss (and for you to pick up as a simple task). Please make sure we chat about this next time we sync.

MaheshRavishankar on May 3, 2024

Not sure why this keeps closing

@pashu123 put a “fixes” command in a commit message and now anyone who has write access to the repo will close it when they merge in that commit to their forks of whatever 😛 aartbik/torch-mlir@8c48135

Why is Github unable to prevent actions on forks from spamming main repos… Seems like a big anti-feature.

qedawkins on May 3, 2024

Agreed at handling even if not generalized as it’s pretty catastrophic to clone embeddings.

I think the more durable fix may be proper propagation: we should sink any exts down/hoist truncs up across memcpy-like ops (such as this gather or a scatter). We may with the current logic be in a better situation but still want to ensure we don’t materialize ext/trunc dispatches unless absolutely required.

benvanik on May 3, 2024

Not sure why this keeps closing

@pashu123 put a “fixes” command in a commit message and now anyone who has write access to the repo will close it when they merge in that commit to their forks of whatever 😛 https://github.com/aartbik/torch-mlir/commit/8c48135a426b84fa412b031fc92e12826ff60b31

benvanik on May 3, 2024

Confirmed that the fusion is not expected. @MaheshRavishankar will fix it.

For the gather codegen issue, @pashu123 could you create a input case for the generic op and see what’s happening? I’m expecting that some dimensions would be collapsed, and the next issue could be tile size selection. https://github.com/iree-org/iree/pull/17227 could help, but there could other issues remaining on the table.

  %16 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%12 : tensor<4x?xi64>) outs(%13 : tensor<4x?x4096xf32>) {
  ^bb0(%in: i64, %out: f32):
    %17 = arith.index_cast %in : i64 to index
    %18 = linalg.index 2 : index
    %extracted = tensor.extract %15[%17, %18] : tensor<128256x4096xf32>
    linalg.yield %extracted : f32
  } -> tensor<4x?x4096xf32>

hanhanW on May 1, 2024

I think there are still action items in the issue, the look-up table fusion is scaring me. We should fix that at least. The tile sizes for vector.gather are problematic. They will be fully unrolled, which looks really bad.

I never intended to close the issue; I don’t know if it got closed automatically. Yes, for the mixed precision case in which we have activations represented as f32, we still have action items to do.

pashu123 on May 1, 2024

I think there are still action items in the issue, the look-up table fusion is scaring me. We should fix that at least. The tile sizes for vector.gather are problematic. They will be fully unrolled, which looks really bad.

hanhanW on May 1, 2024

It looks like it failed in SetEncoding (or related passes). @pashu123 given that you want to get more involved in these tasks, would you like to triage the issue when you’re available?

hanhanW on Apr 30, 2024