iree: [CPU] Understand why IREE is 2x slower than TFLite on ViT INT8 on ARM64

What happened?

On Pixel 8 Pro CPU, IREE latency on ViT is 236ms whereas TFLite is 118ms. Let’s understand why.

Steps to reproduce your issue

Download https://storage.googleapis.com/iree-model-artifacts/tflite/tflite_models_1698315913/VIT_CLASSIFICATION_INT8_TFLITE_3X224X224XINT8/tosa.mlirbc

Build a version of IREE with https://github.com/openxla/iree/pull/15387 patched.

Compile for Android

iree-compile tosa.mlirbc \
    --iree-hal-target-backends=llvm-cpu \
    --iree-input-type="tosa" \
    --iree-input-demote-f64-to-f32=false \
    --iree-input-demote-i64-to-i32=false \
    --iree-input-promote-bf16-to-f32=false \
    --iree-llvmcpu-debug-symbols=true \
    --iree-vm-bytecode-module-strip-source-map=true \
    --iree-vm-emit-polyglot-zip=false \
    --iree-llvmcpu-target-cpu="cortex-a715" \
    --iree-llvmcpu-target-triple="aarch64-none-linux-android33" \
    --iree-opt-data-tiling \
    --iree-llvmcpu-enable-microkernels \
    -o vit.vmfb

Run on device:

taskset 1F0 iree-benchmark-module --module=vit.vfmb --task_topology_group_count=5 --task_topology_cpu_ids=0,1,2,3,4 --device=local-task --function=main --input=1x3x224x224xi8=0

What component(s) does this issue relate to?

Compiler

Version information

d32d8ce6c

Additional context

No response

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Comments: 56 (54 by maintainers)

Most upvoted comments

If looking for a 2x, ignore the inefficient 4% thing for now?

@bjacob knows the xnnpack and ukernel story well. Unless if something has changed, I don’t think this is doing anything particularly advanced and probably just needs some catch-up. Any advice to keep the analysis on the most profitable path, Benoit?

You can also quickly get a profile by adding --enable_op_profiling=true to the benchmark run.

Here is a nice article on performance measurement of TFLite models, including ways to profile them: https://www.tensorflow.org/lite/performance/measurement

In that article is a link to the prebuilt benchmark tool for Android ARM: https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model

Once downloaded, adb push the tool to the device. Also download and push the TFLite flatbuffer: https://storage.googleapis.com/iree-model-artifacts/jax/jax_models_0.4.20_1699872537/VIT_CLASSIFICATION_JAX_3X224X224XF32/model_int8.tflite

Then on the device run:

./android_aarch64_benchmark_model --graph <path to tflite model> --num_threads=1