iree: [CPU] i8mm i8*i8=i32 ukernel is slower than dotprod

What happened?

On a simple matmul:

func.func @matmul(%lhs: tensor<197x768xi8>, %rhs: tensor<768x768xi8>, %acc: tensor<197x768xi32>) -> tensor<197x768xi32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<197x768xi8>, tensor<768x768xi8>) outs(%acc: tensor<197x768xi32>) -> tensor<197x768xi32>
  return %result: tensor<197x768xi32>
}

The +dotprod ukernel performs faster than the +i8mm ukernel:

i8mm latency (ms) dotprod latency (ms)
1 thread, no distribution 7.9 6.3
1 thread, distribution 10.3 6.1
i8mm latency (ms) dotprod latency (ms)
5 threads, no distribution 21.8 19.7
5 threads, distribution 6.3 3.5

This is not expected since microbenchmarks show that i8mm performs better than dotprod:

-----------------------------------------------------------------------------------------------------------
Benchmark                                                 Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------
BM_mmt4d_s8s8s32_tile_1x8x4_dotprod/real_time         0.129 us        0.128 us      8388607 items_per_second=127.338G/s
BM_mmt4d_s8s8s32_tile_2x8x4_dotprod/real_time         0.164 us        0.163 us      8388607 items_per_second=200.269G/s
BM_mmt4d_s8s8s32_tile_4x8x4_dotprod/real_time         0.253 us        0.253 us      4194303 items_per_second=258.707G/s
BM_mmt4d_s8s8s32_tile_8x8x4_dotprod/real_time         0.363 us        0.362 us      2097151 items_per_second=361.478G/s
BM_mmt4d_s8s8s32_tile_1x8x8_i8mm/real_time            0.191 us        0.191 us      4194303 items_per_second=171.266G/s
BM_mmt4d_s8s8s32_tile_2x8x8_i8mm/real_time            0.182 us        0.182 us      4194303 items_per_second=359.516G/s
BM_mmt4d_s8s8s32_tile_4x8x8_i8mm/real_time            0.274 us        0.273 us      4194303 items_per_second=479.116G/s
BM_mmt4d_s8s8s32_tile_8x8x8_i8mm/real_time            0.452 us        0.451 us      2097151 items_per_second=579.566G/s

Steps to reproduce your issue

  1. Create matmul.mlir file with the contents below:
func.func @matmul(%lhs: tensor<197x768xi8>, %rhs: tensor<768x768xi8>, %acc: tensor<197x768xi32>) -> tensor<197x768xi32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<197x768xi8>, tensor<768x768xi8>) outs(%acc: tensor<197x768xi32>) -> tensor<197x768xi32>
  return %result: tensor<197x768xi32>
}
  1. Compile
INPUT_FILE=/tmp/matmul.mlir
ARM_CPU_FEATURE=dotprod
#ARM_CPU_FEATURE=i8mm

iree-compile "${INPUT_FILE}" --iree-hal-target-backends="llvm-cpu" --iree-input-type="tosa" --iree-llvmcpu-target-cpu-features="+${ARM_CPU_FEATURE}" --iree-llvmcpu-target-triple="aarch64-none-linux-android34" --iree-opt-data-tiling=true --iree-llvmcpu-enable-ukernels=all -o /tmp/matmul_${ARM_CPU_FEATURE}.vmfb

Add --iree-llvmcpu-disable-distribution to disable distribution.

  1. Benchmark
ARM_CPU_FEATURE=dotprod
#ARM_CPU_FEATURE=i8mm

# On mobile device.
iree-benchmark-module --module=matmul_${ARM_CPU_FEATURE}.vmfb --device=local-task --function=matmul --input=197x768xi8=0 --input=768x768xi8=0 --input=197x768xi32=0 --task_topology_cpu_ids=0

What component(s) does this issue relate to?

Compiler

Version information

d7de68a33f

Additional context

Related issue https://github.com/openxla/iree/issues/15399

About this issue

  • Original URL
  • State: closed
  • Created 4 months ago
  • Comments: 16 (15 by maintainers)

Commits related to this issue

Most upvoted comments

(EDIT - this is fine actually, see at bottom of this comment) - There’s still something (else) that’s wonky about that asm.

It’s subtracting 0x1800 == 6K from the pointer registers:

mov x16, #-0x1800
[...]
// within the loop
add x17, x10, x16
add x0, x14, x16

only to then apply an immediate offset of 0x1800 in the load-instructions to compensate that:

ldr q9, [x0, #0x1800]
// (and likewise the following 7 load instructions)

This seems silly; it’s almost benign, but not quite: the two add instructions are part of the inner loop and collectively make it take ~ 1 extra cycle per loop iteration, which is probably a ~ 10% performance loss depending on the CPU, but particularly on a Cortex-X2 / Cortex-X3 that does 4 smmla instructions per cycle.

EDIT - this is actually fine. This x16 is actually the loop counter: see how it is the operand to the adds instruction, the s suffix means set the condition-flags driving the loop’s conditional jump. It’s just counting in bytes left to be traversed until the end is reached. The 6K value comes from the dimensions of this dispatch, everything is so perfectly inlined that the static shape of this matmul propagates all the way to here. The M0/N0 width of this kernel being 8, the K-dimension of this matmul being 768, 8 * 768 == 6K of LHS/RHS data traversed by this ukernel inner loop.

And the reason why this probably isn’t a 10% performance loss is that there are more than enough dual-issue slots, assuming an out-of-order core able to reorder this code a bit, these general-register additions are going to dual-issue fine with the NEON-register arithmetic. This code is perhaps not as friendly to an in-order Cortex-A510 but there, the much higher cost of the smmla arithmetic (only 1 per cycle, vs 2 or 4 on A710/X2) means that that dominates much more anyway.