iree: [llm perf] Slow kernel turbine_llm_mmtfp_3d_8640_3200_f32f16

// iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host -o turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb turbine_llm_mmtfp_3d_8640_3200_f32f16.mlir
// iree-benchmark-module --module=turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb --function=turbine_llm_mmtfp_3d_8640_3200_f32f16 --input=4x128x3200xf32 --input=8640x3200xf16

#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
module {
  util.func public @turbine_llm_mmtfp_3d_8640_3200_f32f16(%arg0: tensor<?x?x3200xf32>, %arg1: tensor<8640x3200xf16>) -> tensor<?x?x8640xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?x3200xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?x3200xf32>
    %0 = tensor.empty(%dim) : tensor<?x8640x3200xf16>
    %1 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg1 : tensor<8640x3200xf16>) outs(%0 : tensor<?x8640x3200xf16>) {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    } -> tensor<?x8640x3200xf16>
    %2 = tensor.empty(%dim, %dim_0) : tensor<?x?x8640xf32>
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    %4 = linalg.batch_matmul_transpose_b ins(%arg0, %1 : tensor<?x?x3200xf32>, tensor<?x8640x3200xf16>) outs(%3 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
    util.return %4 : tensor<?x?x8640xf32>
  }
}

Tested on CPU. Performance is at least an order of magnitude below expectations. Needs to be fast on all supported backends.

About this issue

  • Original URL
  • State: open
  • Created 3 months ago
  • Comments: 19 (14 by maintainers)

Commits related to this issue

Most upvoted comments

@pashu123 and I looked the IR dump together today, and we found that the vector level tile sizes are all set to 1s in the lowering_config. My intuition is the logics in elementwise op strategy selection is outdated. We used to tile dims with size=1 when there are dynamic shapes. Because we did not have vectorization strategy. It only worked with static shapes. Today, we have peeling, masking, etc tricks, so we need to revisit it. Here are two action items after the discussion:

  1. Try with different lowering_config and look at the IR dumps and final code. (maybe use [1, 1, 16] or [1, 2, 16] as vector level tile sizes)
  2. Teach the KernelDispatch.cpp to produce such config.

To quickly iterate 1, we can preset translation_info and lowering_config on the op. E.g., see below example and run iree-opt --pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))' repro.mlir

https://github.com/openxla/iree/blob/bd1b10626cb02d3d6c05f67977d1800020203b40/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir#L300-L322

side note: please remember to update hal.executable.target in your experiments.

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in #16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.

Yes. Already spoke to @pashu123 about this. He is going to start looking into it.

What CPU are you measuring on? Here on AMD 7950X3D, setting 1 thread (to be able to make sense of single-thread performance on this CPU) I get items_per_second=2.66772/s, which amounts to 75 Gflop/s (counting each multiply-add as two ops as usual).

+1, I wonder the target CPU as well.

Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in https://github.com/openxla/iree/issues/16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.

@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.