iree: [CPU] Understand why IREE is 2x slower than TFLite on ViT INT8 on ARM64
What happened?
On Pixel 8 Pro CPU, IREE latency on ViT is 236ms whereas TFLite is 118ms. Let’s understand why.
Steps to reproduce your issue
Build a version of IREE with https://github.com/openxla/iree/pull/15387 patched.
Compile for Android
iree-compile tosa.mlirbc \
--iree-hal-target-backends=llvm-cpu \
--iree-input-type="tosa" \
--iree-input-demote-f64-to-f32=false \
--iree-input-demote-i64-to-i32=false \
--iree-input-promote-bf16-to-f32=false \
--iree-llvmcpu-debug-symbols=true \
--iree-vm-bytecode-module-strip-source-map=true \
--iree-vm-emit-polyglot-zip=false \
--iree-llvmcpu-target-cpu="cortex-a715" \
--iree-llvmcpu-target-triple="aarch64-none-linux-android33" \
--iree-opt-data-tiling \
--iree-llvmcpu-enable-microkernels \
-o vit.vmfb
Run on device:
taskset 1F0 iree-benchmark-module --module=vit.vfmb --task_topology_group_count=5 --task_topology_cpu_ids=0,1,2,3,4 --device=local-task --function=main --input=1x3x224x224xi8=0
What component(s) does this issue relate to?
Compiler
Version information
d32d8ce6c
Additional context
No response
About this issue
- Original URL
- State: open
- Created 8 months ago
- Comments: 56 (54 by maintainers)
If looking for a 2x, ignore the inefficient 4% thing for now?
@bjacob knows the xnnpack and ukernel story well. Unless if something has changed, I don’t think this is doing anything particularly advanced and probably just needs some catch-up. Any advice to keep the analysis on the most profitable path, Benoit?
You can also quickly get a profile by adding
--enable_op_profiling=true
to the benchmark run.Here is a nice article on performance measurement of TFLite models, including ways to profile them: https://www.tensorflow.org/lite/performance/measurement
In that article is a link to the prebuilt benchmark tool for Android ARM: https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model
Once downloaded,
adb push
the tool to the device. Also download and push the TFLite flatbuffer: https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.20_1699872537/VIT_CLASSIFICATION_JAX_3X224X224XF32/model_int8.tfliteThen on the device run: