tensorflow: XNNPACK delegate performs much slower than default TFLite backend if multi-threading is configured according to documentation

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04, Android 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Any Android smarthphone
TensorFlow installed from (source or binary): Source
TensorFlow version (use command below): 2.3.0
Python version: -
Bazel version (if compiling from source): 3.1.0
GCC/Compiler version (if compiling from source): GCC 5.4.0 / Clang shipped with Android NDK 21
CUDA/cuDNN version: -
GPU model and memory: -

When TFLite is built with XNNPACK, performance improvement is expected. However, seems like the code provided in /tensorflow/lite/examples/minimal with a minor change of interpreter settings leads to performance degradation comparing to the default build.

So here is the code taken from minimal example with my changes:

  ...

  // Load model
  std::unique_ptr<tflite::FlatBufferModel> model =
      tflite::FlatBufferModel::BuildFromFile(filename);
  TFLITE_MINIMAL_CHECK(model != nullptr);

  // Build the interpreter
  tflite::ops::builtin::BuiltinOpResolver resolver;
  InterpreterBuilder builder(*model, resolver);
  std::unique_ptr<Interpreter> interpreter;
  builder(&interpreter);
  TFLITE_MINIMAL_CHECK(interpreter != nullptr);

  // Allocate tensor buffers.
  TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);
  printf("=== Pre-invoke Interpreter State ===\n");
  tflite::PrintInterpreterState(interpreter.get());

  // Set number of threads (added by me)
  interpreter->SetNumThreads(8);

  // Run inference
  TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);
  printf("\n\n=== Post-invoke Interpreter State ===\n");
  tflite::PrintInterpreterState(interpreter.get());

  ...

This code performs slower if executed with TFLite + XNNPACK build. I’ve tested it both on x64 desktop and arm64 Android using ResNet-34 FP32 TFLite model and observed the exact same performance degradation.

I was able to fix the behavior and achieve 30% performance improvement only after I spent few hours with TFLite code and found out that tflite::Interpreter::SetNumThreads is not applied to XNNPACK delegate (maybe to other delegates as well), because XNNPACK delegate is only initialized in builder(&interpreter) with number of threads passed to this invocation and then is not being updated on interpreter->SetNumThreads(8) call! In the case is illustrated by the code above), XNNPACK work in single-threaded mode or so. So the fix is to initialize interpreter as the following:

  builder(&interpreter, 8);

Then XNNPACK really introduces significant performance improvement.

I’m OK with the solution that I found, but I was really confused with this issue and spent almost a day figuring out why can’t I achieve claimed performance, because neither official documentation nor TFLite code commentary does not mention InterpreterBuilder’s num_threads argument as something necessary or mention it at least. Thus, following “Tweak the number of threads” documentation section in combination with using XNNPACK will lead anyone to this pitfall and will result with a very poor performance.

If it’s need I can provide more standalone example, point to the parts of TFLite code, which are responsible for this behavior, and detailed measurements obtained on different devices.

About this issue

Original URL
State: closed
Created 4 years ago
Reactions: 5
Comments: 22 (11 by maintainers)

Commits related to this issue

Updated XNNPACK delegeate readme Added some info on XNNPACK delegate to avoid confusion cause by XNNPACK engine being single-threaded by default. Further details are available in the description of ... — committed to dev0x13/tensorflow by dev0x13 4 years ago
Patched C API implementation to correctly use interpreter num threads Patched C API implementation to correctly use interpreter number of threads setting while invoking InterpreterBuilder since it is... — committed to dev0x13/tensorflow by dev0x13 3 years ago

Most upvoted comments

@andreydung Yes, I am building the library this way. Here are the steps for clarity sake:

Get sources from https://github.com/tensorflow/tensorflow/releases/tag/v2.3.0.
Run ./configure with all parameters being default, except Android NDK setup (I am using NDK 21 btw).
Run bazel build -c opt --define tflite_with_xnnpack=true --config=android_arm64 //tensorflow/lite:libtensorflowlite.so.

Works like a charm for me.

dev0x13 on Aug 17, 2020

Hi @dev0x13

The issue must have been fixed with commit https://github.com/tensorflow/tensorflow/commit/3d3c6db1ca2d50f6f07722cd800144f8f736167c.

The updated documentation gives info about setting numthreads while initializing the interpreter.

As TfLite interpreter could internally apply a TfLite delegate by default (i.e. XNNPACK), the number of threads that are available to the default delegate should be set via InterpreterBuilder APIs as follows:


std::unique_ptr interpreter;
tflite::InterpreterBuilder builder(tflite model, op resolver);
builder.SetNumThreads(...)
ASSERT_EQ(builder(&interpreter), kTfLiteOk);

Thanks.

pjpratik on Apr 28, 2023

@multiverse-tf Thank you for the clarification! However, I am concerned by this line. Yet XNNPACK delegate is being applied to the graph in AllocateTensors, it’s still being created on the interpreter creation stage with no respect to SetNumThreads option. Taking into consideration the fact that XNNPACK delegate only initializes its thread pool once on the construction stage, the issue is still present in my perspective. Please correct me if I am wrong.

Yes, you are right about this. I was confused w/ an earlier implementation of this feature where the number of threads was passed when creating the XNNPACK delegate.

I think such a delicate situation is mainly caused by we trying to apply the XNNPACK delegate by default while honoring users’ intention to explicitly use another TfLite delegate with C++ APIs. When it comes to using C APIs, I think this pitfall will be avoided as one has to provide the number of threads when creating the interpreter.

multiverse-tf on Jun 4, 2021

@dev0x13 Thanks for your response, it’s much appreciated.

andreydung on Aug 17, 2020