iree: [llm perf] Slow kernel turbine_llm_mmtfp_3d_8640_3200_f32f16

// iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host -o turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb turbine_llm_mmtfp_3d_8640_3200_f32f16.mlir
// iree-benchmark-module --module=turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb --function=turbine_llm_mmtfp_3d_8640_3200_f32f16 --input=4x128x3200xf32 --input=8640x3200xf16

#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
module {
  util.func public @turbine_llm_mmtfp_3d_8640_3200_f32f16(%arg0: tensor<?x?x3200xf32>, %arg1: tensor<8640x3200xf16>) -> tensor<?x?x8640xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?x3200xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?x3200xf32>
    %0 = tensor.empty(%dim) : tensor<?x8640x3200xf16>
    %1 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg1 : tensor<8640x3200xf16>) outs(%0 : tensor<?x8640x3200xf16>) {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    } -> tensor<?x8640x3200xf16>
    %2 = tensor.empty(%dim, %dim_0) : tensor<?x?x8640xf32>
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    %4 = linalg.batch_matmul_transpose_b ins(%arg0, %1 : tensor<?x?x3200xf32>, tensor<?x8640x3200xf16>) outs(%3 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    util.return %4 : tensor<?x?x8640xf32>
  }
}

Tested on CPU. Performance is at least an order of magnitude below expectations. Needs to be fast on all supported backends.

About this issue

Original URL
State: open
Created 3 months ago
Comments: 19 (14 by maintainers)

Commits related to this issue

[CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
[CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
[CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
[CPU][Codegen] Update vector tile size for pack op (#17169) -- This selects required tile sizes if the operand is of f16 type. We choose the tile size to be 16, so it hits the 16x16 vector.tran... — committed to iree-org/iree by pashu123 2 months ago

Most upvoted comments

@pashu123 and I looked the IR dump together today, and we found that the vector level tile sizes are all set to 1s in the lowering_config. My intuition is the logics in elementwise op strategy selection is outdated. We used to tile dims with size=1 when there are dynamic shapes. Because we did not have vectorization strategy. It only worked with static shapes. Today, we have peeling, masking, etc tricks, so we need to revisit it. Here are two action items after the discussion:

Try with different lowering_config and look at the IR dumps and final code. (maybe use [1, 1, 16] or [1, 2, 16] as vector level tile sizes)
Teach the KernelDispatch.cpp to produce such config.

To quickly iterate 1, we can preset translation_info and lowering_config on the op. E.g., see below example and run iree-opt --pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))' repro.mlir

https://github.com/openxla/iree/blob/bd1b10626cb02d3d6c05f67977d1800020203b40/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir#L300-L322

side note: please remember to update hal.executable.target in your experiments.

hanhanW on Apr 18, 2024

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in #16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.

Yes. Already spoke to @pashu123 about this. He is going to start looking into it.

MaheshRavishankar on Apr 16, 2024

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in https://github.com/openxla/iree/issues/16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.

hanhanW on Apr 16, 2024