iree: RV32 code size regression from #11576 LLVM integrate on 2022-12-16

Suspected LLVM integration: #11576

https://perf.iree.dev/serie?IREE?PersonDetect [int8] (TFLite) CPU-RV32-Generic full-inference%2Cdefault-flags [compilation%3Amodule%3Acomponent-size%3Atotal-dispatch-size]

Direct local repro is nontrivial due to MLIR IR format change since that time, requires an iree-import-tflite from that timeframe. To make it easier, attaching here the already imported file. person_detect.zip

cmake --build . --target iree-compile && tools/iree-compile --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvm-target-abi=ilp32 --iree-llvm-target-cpu-features=+m,+a,+f,+zvl512b,+zve32x --iree-llvm-target-cpu=generic-rv32 --iree-llvm-target-triple=riscv32-pc-linux-elf --riscv-v-fixed-length-vector-lmul-max=8 --riscv-v-vector-bits-min=512 benchmark_suites/TFLite/person_detect.tflite.mlir -o /tmp/a.vmfb --iree-llvm-keep-linker-artifacts 2>&1 | grep -o '/.*\.so' | xargs size -A | grep '^\.text' | awk '{print $2}'

Prints:

at IREE commit (and git submodule update) value
79b90d32d1723b0650b33bc5584dccb4828e5421 871912
7b4688272e40e939dc02053c7178b111a21eadd5 180752

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 28 (25 by maintainers)

Commits related to this issue

Most upvoted comments

Bisect results coming soon (~ 5 bisection steps remaining)

I confirm that #12241 fixes the testcase here. Using the test in this Issue description (and using current iree-import-tflite to generate person_detect.tflite.mlir):

commit .text size
Current main 905036
With #12241 185720

Thanks a lot @kuhar for the effective fix!

If possible, I would really appreciate this to be included in main, as our project pulls IREE’s release candidate for iree-compile instead of building from source. I am okay with hiding this behind a compile flag if we don’t want the behavior to be default.

The original IR generates two mul ops: from the first one, only the high part is used, from the second one, only the low part is used. The backend is able to match these patterns and generate the corresponding hi/lo muls included in zve32x. If we generate a single mul and extract the hi/low parts from it, the backend will try to generate a single mul and will end up scalarizing it.