tensorflow: XNNPACK delegate performs much slower than default TFLite backend if multi-threading is configured according to documentation

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04, Android 10
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: Any Android smarthphone
  • TensorFlow installed from (source or binary): Source
  • TensorFlow version (use command below): 2.3.0
  • Python version: -
  • Bazel version (if compiling from source): 3.1.0
  • GCC/Compiler version (if compiling from source): GCC 5.4.0 / Clang shipped with Android NDK 21
  • CUDA/cuDNN version: -
  • GPU model and memory: -

When TFLite is built with XNNPACK, performance improvement is expected. However, seems like the code provided in /tensorflow/lite/examples/minimal with a minor change of interpreter settings leads to performance degradation comparing to the default build.

So here is the code taken from minimal example with my changes:

  ...

  // Load model
  std::unique_ptr<tflite::FlatBufferModel> model =
      tflite::FlatBufferModel::BuildFromFile(filename);
  TFLITE_MINIMAL_CHECK(model != nullptr);

  // Build the interpreter
  tflite::ops::builtin::BuiltinOpResolver resolver;
  InterpreterBuilder builder(*model, resolver);
  std::unique_ptr<Interpreter> interpreter;
  builder(&interpreter);
  TFLITE_MINIMAL_CHECK(interpreter != nullptr);

  // Allocate tensor buffers.
  TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);
  printf("=== Pre-invoke Interpreter State ===\n");
  tflite::PrintInterpreterState(interpreter.get());

  // Set number of threads (added by me)
  interpreter->SetNumThreads(8);

  // Run inference
  TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);
  printf("\n\n=== Post-invoke Interpreter State ===\n");
  tflite::PrintInterpreterState(interpreter.get());

  ...

This code performs slower if executed with TFLite + XNNPACK build. I’ve tested it both on x64 desktop and arm64 Android using ResNet-34 FP32 TFLite model and observed the exact same performance degradation.

I was able to fix the behavior and achieve 30% performance improvement only after I spent few hours with TFLite code and found out that tflite::Interpreter::SetNumThreads is not applied to XNNPACK delegate (maybe to other delegates as well), because XNNPACK delegate is only initialized in builder(&interpreter) with number of threads passed to this invocation and then is not being updated on interpreter->SetNumThreads(8) call! In the case is illustrated by the code above), XNNPACK work in single-threaded mode or so. So the fix is to initialize interpreter as the following:

  builder(&interpreter, 8);

Then XNNPACK really introduces significant performance improvement.

I’m OK with the solution that I found, but I was really confused with this issue and spent almost a day figuring out why can’t I achieve claimed performance, because neither official documentation nor TFLite code commentary does not mention InterpreterBuilder’s num_threads argument as something necessary or mention it at least. Thus, following “Tweak the number of threads” documentation section in combination with using XNNPACK will lead anyone to this pitfall and will result with a very poor performance.

If it’s need I can provide more standalone example, point to the parts of TFLite code, which are responsible for this behavior, and detailed measurements obtained on different devices.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 22 (11 by maintainers)

Commits related to this issue

Most upvoted comments

@andreydung Yes, I am building the library this way. Here are the steps for clarity sake:

  1. Get sources from https://github.com/tensorflow/tensorflow/releases/tag/v2.3.0.
  2. Run ./configure with all parameters being default, except Android NDK setup (I am using NDK 21 btw).
  3. Run bazel build -c opt --define tflite_with_xnnpack=true --config=android_arm64 //tensorflow/lite:libtensorflowlite.so.

Works like a charm for me.

Hi @dev0x13

The issue must have been fixed with commit https://github.com/tensorflow/tensorflow/commit/3d3c6db1ca2d50f6f07722cd800144f8f736167c.

The updated documentation gives info about setting numthreads while initializing the interpreter.

As TfLite interpreter could internally apply a TfLite delegate by default (i.e. XNNPACK), the number of threads that are available to the default delegate should be set via InterpreterBuilder APIs as follows:


std::unique_ptr interpreter;
tflite::InterpreterBuilder builder(tflite model, op resolver);
builder.SetNumThreads(...)
ASSERT_EQ(builder(&interpreter), kTfLiteOk);

Thanks.

@multiverse-tf Thank you for the clarification! However, I am concerned by this line. Yet XNNPACK delegate is being applied to the graph in AllocateTensors, it’s still being created on the interpreter creation stage with no respect to SetNumThreads option. Taking into consideration the fact that XNNPACK delegate only initializes its thread pool once on the construction stage, the issue is still present in my perspective. Please correct me if I am wrong.

Yes, you are right about this. I was confused w/ an earlier implementation of this feature where the number of threads was passed when creating the XNNPACK delegate.

I think such a delicate situation is mainly caused by we trying to apply the XNNPACK delegate by default while honoring users’ intention to explicitly use another TfLite delegate with C++ APIs. When it comes to using C APIs, I think this pitfall will be avoided as one has to provide the number of threads when creating the interpreter.

@dev0x13 Thanks for your response, it’s much appreciated.