iree: [CPU][ARM] DT+UK 50x slower than Default flags on BF16 on Android
What happened?
name | threads | Default flags latency (ms) | DT+UK latency (ms) |
---|---|---|---|
BERT_BASE_BF16_JAX_I32_SEQLEN8 | 1 | 69.1 | 3963.0 |
BERT_BASE_BF16_JAX_I32_SEQLEN32 | 1 | 172.0 | 15964.0 |
BERT_BASE_BF16_JAX_I32_SEQLEN64 | 1 | 326.0 | 32015.0 |
Steps to reproduce your issue
-
Download MLIR: https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.19_1698302455/BERT_BASE_BF16_JAX_I32_SEQLEN8/stablehlo.mlirbc
-
Compile:
ARM_CPU_FEATURES="+v9a,+fullfp16,fp-armv8,+neon,+aes,+sha2,+crc,+lse,+rdm,+complxnum,+rcpc,+sha3,+sm4,+dotprod,+fp16fml,+dit,+flagm,+ssbs,+sb,+altnzcv,+fptoint,+bf16,+i8mm,+bti"
iree-compile stablehlo.mlirbc \
--iree-hal-target-backends="llvm-cpu" \
--iree-input-type="stablehlo" \
--iree-llvmcpu-link-embedded=false \
--iree-input-demote-f64-to-f32=false \
--iree-input-demote-i64-to-i32=false \
--iree-llvmcpu-target-cpu-features="${ARM_CPU_FEATURES}" \
--iree-llvmcpu-target-triple="aarch64-none-linux-android34" \
# For DT+UK.
--iree-opt-data-tiling \
--iree-llvmcpu-enable-microkernels \
-o module.vmfb
- Run on device
iree-benchmark-module --module=module.vmfb --task_topology_cpu_ids=0 --device=local-task --function=main --input=1x8xi32=0 --input=1x8xi32=0
What component(s) does this issue relate to?
Compiler
Version information
675aafb
Additional context
No response
About this issue
- Original URL
- State: closed
- Created 8 months ago
- Comments: 19 (16 by maintainers)
Commits related to this issue
- ukernels: add `bf16 * bf16 -> bf16` optimized tile functions for x86 and arm64. (#15543) This should fix the "50x slowdown" performance issue #15504. — committed to iree-org/iree by bjacob 8 months ago
- ukernels: add `bf16 * bf16 -> bf16` optimized tile functions for x86 and arm64. (#15543) This should fix the "50x slowdown" performance issue #15504. — committed to ramiro050/iree by bjacob 8 months ago
Note @mariecwhite @dcaballe : earlier draft versions of #15543 had debug logic left in that was preventing the optimized tile function from being selected. That’s fixed now, but if you tested earlier and got no performance impact, that would have been the reason.