tensorflow: XNNPACK delegate performs much slower than default TFLite backend if multi-threading is configured according to documentation
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04, Android 10
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Any Android smarthphone
- TensorFlow installed from (source or binary): Source
- TensorFlow version (use command below): 2.3.0
- Python version: -
- Bazel version (if compiling from source): 3.1.0
- GCC/Compiler version (if compiling from source): GCC 5.4.0 / Clang shipped with Android NDK 21
- CUDA/cuDNN version: -
- GPU model and memory: -
When TFLite is built with XNNPACK, performance improvement is expected. However, seems like the code provided in /tensorflow/lite/examples/minimal
with a minor change of interpreter settings leads to performance degradation comparing to the default build.
So here is the code taken from minimal example with my changes:
...
// Load model
std::unique_ptr<tflite::FlatBufferModel> model =
tflite::FlatBufferModel::BuildFromFile(filename);
TFLITE_MINIMAL_CHECK(model != nullptr);
// Build the interpreter
tflite::ops::builtin::BuiltinOpResolver resolver;
InterpreterBuilder builder(*model, resolver);
std::unique_ptr<Interpreter> interpreter;
builder(&interpreter);
TFLITE_MINIMAL_CHECK(interpreter != nullptr);
// Allocate tensor buffers.
TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);
printf("=== Pre-invoke Interpreter State ===\n");
tflite::PrintInterpreterState(interpreter.get());
// Set number of threads (added by me)
interpreter->SetNumThreads(8);
// Run inference
TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);
printf("\n\n=== Post-invoke Interpreter State ===\n");
tflite::PrintInterpreterState(interpreter.get());
...
This code performs slower if executed with TFLite + XNNPACK build. I’ve tested it both on x64 desktop and arm64 Android using ResNet-34 FP32 TFLite model and observed the exact same performance degradation.
I was able to fix the behavior and achieve 30% performance improvement only after I spent few hours with TFLite code and found out that tflite::Interpreter::SetNumThreads is not applied to XNNPACK delegate (maybe to other delegates as well), because XNNPACK delegate is only initialized in builder(&interpreter)
with number of threads passed to this invocation and then is not being updated on interpreter->SetNumThreads(8)
call! In the case is illustrated by the code above), XNNPACK work in single-threaded mode or so. So the fix is to initialize interpreter as the following:
builder(&interpreter, 8);
Then XNNPACK really introduces significant performance improvement.
I’m OK with the solution that I found, but I was really confused with this issue and spent almost a day figuring out why can’t I achieve claimed performance, because neither official documentation nor TFLite code commentary does not mention InterpreterBuilder
’s num_threads
argument as something necessary or mention it at least. Thus, following “Tweak the number of threads” documentation section in combination with using XNNPACK will lead anyone to this pitfall and will result with a very poor performance.
If it’s need I can provide more standalone example, point to the parts of TFLite code, which are responsible for this behavior, and detailed measurements obtained on different devices.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 5
- Comments: 22 (11 by maintainers)
Commits related to this issue
- Updated XNNPACK delegeate readme Added some info on XNNPACK delegate to avoid confusion cause by XNNPACK engine being single-threaded by default. Further details are available in the description of ... — committed to dev0x13/tensorflow by dev0x13 4 years ago
- Patched C API implementation to correctly use interpreter num threads Patched C API implementation to correctly use interpreter number of threads setting while invoking InterpreterBuilder since it is... — committed to dev0x13/tensorflow by dev0x13 3 years ago
@andreydung Yes, I am building the library this way. Here are the steps for clarity sake:
./configure
with all parameters being default, except Android NDK setup (I am using NDK 21 btw).bazel build -c opt --define tflite_with_xnnpack=true --config=android_arm64 //tensorflow/lite:libtensorflowlite.so
.Works like a charm for me.
Hi @dev0x13
The issue must have been fixed with commit https://github.com/tensorflow/tensorflow/commit/3d3c6db1ca2d50f6f07722cd800144f8f736167c.
The updated documentation gives info about setting
numthreads
while initializing the interpreter.Thanks.
Yes, you are right about this. I was confused w/ an earlier implementation of this feature where the number of threads was passed when creating the XNNPACK delegate.
I think such a delicate situation is mainly caused by we trying to apply the XNNPACK delegate by default while honoring users’ intention to explicitly use another TfLite delegate with C++ APIs. When it comes to using C APIs, I think this pitfall will be avoided as one has to provide the number of threads when creating the interpreter.
@dev0x13 Thanks for your response, it’s much appreciated.