iree: [CPU] i8mm i8*i8=i32 ukernel is slower than dotprod
What happened?
On a simple matmul:
func.func @matmul(%lhs: tensor<197x768xi8>, %rhs: tensor<768x768xi8>, %acc: tensor<197x768xi32>) -> tensor<197x768xi32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<197x768xi8>, tensor<768x768xi8>) outs(%acc: tensor<197x768xi32>) -> tensor<197x768xi32>
return %result: tensor<197x768xi32>
}
The +dotprod ukernel performs faster than the +i8mm ukernel:
i8mm latency (ms) | dotprod latency (ms) | |
---|---|---|
1 thread, no distribution | 7.9 | 6.3 |
1 thread, distribution | 10.3 | 6.1 |
i8mm latency (ms) | dotprod latency (ms) | |
---|---|---|
5 threads, no distribution | 21.8 | 19.7 |
5 threads, distribution | 6.3 | 3.5 |
This is not expected since microbenchmarks show that i8mm performs better than dotprod:
-----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_mmt4d_s8s8s32_tile_1x8x4_dotprod/real_time 0.129 us 0.128 us 8388607 items_per_second=127.338G/s
BM_mmt4d_s8s8s32_tile_2x8x4_dotprod/real_time 0.164 us 0.163 us 8388607 items_per_second=200.269G/s
BM_mmt4d_s8s8s32_tile_4x8x4_dotprod/real_time 0.253 us 0.253 us 4194303 items_per_second=258.707G/s
BM_mmt4d_s8s8s32_tile_8x8x4_dotprod/real_time 0.363 us 0.362 us 2097151 items_per_second=361.478G/s
BM_mmt4d_s8s8s32_tile_1x8x8_i8mm/real_time 0.191 us 0.191 us 4194303 items_per_second=171.266G/s
BM_mmt4d_s8s8s32_tile_2x8x8_i8mm/real_time 0.182 us 0.182 us 4194303 items_per_second=359.516G/s
BM_mmt4d_s8s8s32_tile_4x8x8_i8mm/real_time 0.274 us 0.273 us 4194303 items_per_second=479.116G/s
BM_mmt4d_s8s8s32_tile_8x8x8_i8mm/real_time 0.452 us 0.451 us 2097151 items_per_second=579.566G/s
Steps to reproduce your issue
- Create matmul.mlir file with the contents below:
func.func @matmul(%lhs: tensor<197x768xi8>, %rhs: tensor<768x768xi8>, %acc: tensor<197x768xi32>) -> tensor<197x768xi32> {
%result = linalg.matmul ins(%lhs, %rhs: tensor<197x768xi8>, tensor<768x768xi8>) outs(%acc: tensor<197x768xi32>) -> tensor<197x768xi32>
return %result: tensor<197x768xi32>
}
- Compile
INPUT_FILE=/tmp/matmul.mlir
ARM_CPU_FEATURE=dotprod
#ARM_CPU_FEATURE=i8mm
iree-compile "${INPUT_FILE}" --iree-hal-target-backends="llvm-cpu" --iree-input-type="tosa" --iree-llvmcpu-target-cpu-features="+${ARM_CPU_FEATURE}" --iree-llvmcpu-target-triple="aarch64-none-linux-android34" --iree-opt-data-tiling=true --iree-llvmcpu-enable-ukernels=all -o /tmp/matmul_${ARM_CPU_FEATURE}.vmfb
Add --iree-llvmcpu-disable-distribution
to disable distribution.
- Benchmark
ARM_CPU_FEATURE=dotprod
#ARM_CPU_FEATURE=i8mm
# On mobile device.
iree-benchmark-module --module=matmul_${ARM_CPU_FEATURE}.vmfb --device=local-task --function=matmul --input=197x768xi8=0 --input=768x768xi8=0 --input=197x768xi32=0 --task_topology_cpu_ids=0
What component(s) does this issue relate to?
Compiler
Version information
d7de68a33f
Additional context
Related issue https://github.com/openxla/iree/issues/15399
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 16 (15 by maintainers)
(EDIT - this is fine actually, see at bottom of this comment) - There’s still something (else) that’s wonky about that asm.
It’s subtracting 0x1800 == 6K from the pointer registers:
only to then apply an immediate offset of 0x1800 in the load-instructions to compensate that:
This seems silly; it’s almost benign, but not quite: the two
add
instructions are part of the inner loop and collectively make it take ~ 1 extra cycle per loop iteration, which is probably a ~ 10% performance loss depending on the CPU, but particularly on a Cortex-X2 / Cortex-X3 that does 4 smmla instructions per cycle.EDIT - this is actually fine. This
x16
is actually the loop counter: see how it is the operand to theadds
instruction, thes
suffix means set the condition-flags driving the loop’s conditional jump. It’s just counting in bytes left to be traversed until the end is reached. The 6K value comes from the dimensions of this dispatch, everything is so perfectly inlined that the static shape of this matmul propagates all the way to here. The M0/N0 width of this kernel being 8, the K-dimension of this matmul being 768, 8 * 768 == 6K of LHS/RHS data traversed by this ukernel inner loop.And the reason why this probably isn’t a 10% performance loss is that there are more than enough dual-issue slots, assuming an out-of-order core able to reorder this code a bit, these general-register additions are going to dual-issue fine with the NEON-register arithmetic. This code is perhaps not as friendly to an in-order Cortex-A510 but there, the much higher cost of the smmla arithmetic (only 1 per cycle, vs 2 or 4 on A710/X2) means that that dominates much more anyway.