iree: [CPU][ARM] DT+UK 50x slower than Default flags on BF16 on Android

What happened?

name threads Default flags latency (ms) DT+UK latency (ms)
BERT_BASE_BF16_JAX_I32_SEQLEN8 1 69.1 3963.0
BERT_BASE_BF16_JAX_I32_SEQLEN32 1 172.0 15964.0
BERT_BASE_BF16_JAX_I32_SEQLEN64 1 326.0 32015.0

Steps to reproduce your issue

  1. Download MLIR: https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.19_1698302455/BERT_BASE_BF16_JAX_I32_SEQLEN8/stablehlo.mlirbc

  2. Compile:

ARM_CPU_FEATURES="+v9a,+fullfp16,fp-armv8,+neon,+aes,+sha2,+crc,+lse,+rdm,+complxnum,+rcpc,+sha3,+sm4,+dotprod,+fp16fml,+dit,+flagm,+ssbs,+sb,+altnzcv,+fptoint,+bf16,+i8mm,+bti"

iree-compile stablehlo.mlirbc \
    --iree-hal-target-backends="llvm-cpu" \
    --iree-input-type="stablehlo" \
    --iree-llvmcpu-link-embedded=false \
    --iree-input-demote-f64-to-f32=false \
    --iree-input-demote-i64-to-i32=false \
    --iree-llvmcpu-target-cpu-features="${ARM_CPU_FEATURES}" \
    --iree-llvmcpu-target-triple="aarch64-none-linux-android34" \
    # For DT+UK.
    --iree-opt-data-tiling \
    --iree-llvmcpu-enable-microkernels \
    -o module.vmfb
  1. Run on device
iree-benchmark-module --module=module.vmfb --task_topology_cpu_ids=0 --device=local-task --function=main --input=1x8xi32=0 --input=1x8xi32=0

What component(s) does this issue relate to?

Compiler

Version information

675aafb

Additional context

No response

About this issue

  • Original URL
  • State: closed
  • Created 8 months ago
  • Comments: 19 (16 by maintainers)

Commits related to this issue

Most upvoted comments

Note @mariecwhite @dcaballe : earlier draft versions of #15543 had debug logic left in that was preventing the optimized tile function from being selected. That’s fixed now, but if you tested earlier and got no performance impact, that would have been the reason.