iree: [llm perf] Slow kernel turbine_llm_mmtfp_3d_8640_3200_f32f16
// iree-compile --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host -o turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb turbine_llm_mmtfp_3d_8640_3200_f32f16.mlir
// iree-benchmark-module --module=turbine_llm_mmtfp_3d_8640_3200_f32f16_cpu.vmfb --function=turbine_llm_mmtfp_3d_8640_3200_f32f16 --input=4x128x3200xf32 --input=8640x3200xf16
#map = affine_map<(d0, d1, d2) -> (d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
module {
util.func public @turbine_llm_mmtfp_3d_8640_3200_f32f16(%arg0: tensor<?x?x3200xf32>, %arg1: tensor<8640x3200xf16>) -> tensor<?x?x8640xf32> {
%cst = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%dim = tensor.dim %arg0, %c0 : tensor<?x?x3200xf32>
%dim_0 = tensor.dim %arg0, %c1 : tensor<?x?x3200xf32>
%0 = tensor.empty(%dim) : tensor<?x8640x3200xf16>
%1 = linalg.generic {indexing_maps = [#map, #map1], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg1 : tensor<8640x3200xf16>) outs(%0 : tensor<?x8640x3200xf16>) {
^bb0(%in: f16, %out: f16):
linalg.yield %in : f16
} -> tensor<?x8640x3200xf16>
%2 = tensor.empty(%dim, %dim_0) : tensor<?x?x8640xf32>
%3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
%4 = linalg.batch_matmul_transpose_b ins(%arg0, %1 : tensor<?x?x3200xf32>, tensor<?x8640x3200xf16>) outs(%3 : tensor<?x?x8640xf32>) -> tensor<?x?x8640xf32>
util.return %4 : tensor<?x?x8640xf32>
}
}
Tested on CPU. Performance is at least an order of magnitude below expectations. Needs to be fast on all supported backends.
About this issue
- Original URL
- State: open
- Created 3 months ago
- Comments: 19 (14 by maintainers)
Commits related to this issue
- [CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
- [CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
- [CPU][Codegen] Update vector tile size for pack op This selects required tile sizes in case the operand is of f16 dtype. We select the tile size to be 16, so it hits the 16x16 vector.transpose implem... — committed to pashu123/iree by pashu123 2 months ago
- [CPU][Codegen] Update vector tile size for pack op (#17169) -- This selects required tile sizes if the operand is of f16 type. We choose the tile size to be 16, so it hits the 16x16 vector.tran... — committed to iree-org/iree by pashu123 2 months ago
@pashu123 and I looked the IR dump together today, and we found that the vector level tile sizes are all set to 1s in the lowering_config. My intuition is the logics in elementwise op strategy selection is outdated. We used to tile dims with size=1 when there are dynamic shapes. Because we did not have vectorization strategy. It only worked with static shapes. Today, we have peeling, masking, etc tricks, so we need to revisit it. Here are two action items after the discussion:
To quickly iterate 1, we can preset
translation_info
andlowering_config
on the op. E.g., see below example and runiree-opt --pass-pipeline='builtin.module(iree-llvmcpu-select-lowering-strategy, func.func(iree-llvmcpu-lower-executable-target))' repro.mlir
https://github.com/openxla/iree/blob/bd1b10626cb02d3d6c05f67977d1800020203b40/compiler/src/iree/compiler/Codegen/LLVMCPU/test/pipeline_tests.mlir#L300-L322
side note: please remember to update
hal.executable.target
in your experiments.Yes. Already spoke to @pashu123 about this. He is going to start looking into it.
+1, I wonder the target CPU as well.
Thanks @bjacob for the great analysis! A potentially performance bug could be in packing on f16 types. I have been working on pack codegen on and off, but the work scoped in https://github.com/openxla/iree/issues/16314 is not finished yet. So +1 on what Benoit suggested. We need to tracy this.
@MaheshRavishankar is this one of the tasks that you mentioned @pashu123 could pick up? If so, he can start with what Benoit suggested.