tensorflow: Very slow quantized tflite model

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (or github SHA if from source): 2.2.0

Command used to run the converter or code if you’re using the Python API If possible, please share a link to Colab/Jupyter/any notebook.

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.representative_dataset = representative_dataset_gen
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
tf_lite_model = converter.convert()

with open("model.tflite", "wb") as f:
    f.write(tf_lite_model)

The output from the converter invocation

2020-06-05 10:53:29.063149: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:29.063233: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:29.080730: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:29.080748: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0.006ms.
2020-06-05 10:53:29.080752: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   function_optimizer: function_optimizer did nothing. time = 0ms.
2020-06-05 10:53:32.284115: I tensorflow/core/grappler/devices.cc:55] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0
2020-06-05 10:53:32.284242: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2020-06-05 10:53:33.407982: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:797] Optimization results for grappler item: graph_to_optimize
2020-06-05 10:53:33.408011: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (-568), 1139 edges (-568), time = 474.12ms.
2020-06-05 10:53:33.408016: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:799]   constant_folding: Graph size after: 1092 nodes (0), 1139 edges (0), time = 213.886ms.

Also, please include a link to the saved model or GraphDef

https://drive.google.com/file/d/1imjVvw8IqQ6tvQRYaKJi_ynxQUHBBSH_/view?usp=sharing

Failure details Before conversion, running standard keras model on CPU took ~300ms per frame. After conversion it takes ~55s. Eventually I want to deploy the model on Coral Dev Board. Currently after compiling it for edge TPU inference takes ~4s using Coral.

Is it normal that it’s so slow? I expect it to be at least not slower than before conversion.

Any other info / logs Logs from edge tpu compiler:

Edge TPU Compiler version 2.1.302470888
Input: model.tflite
Output: model_edgetpu.tflite

Operator                       Count      Status

ADD                            1          More than one subgraph is not supported
ADD                            71         Mapped to Edge TPU
MAX_POOL_2D                    1          Mapped to Edge TPU
PAD                            35         Mapped to Edge TPU
MUL                            35         Mapped to Edge TPU
CONCATENATION                  1          More than one subgraph is not supported
QUANTIZE                       1          Operation is otherwise supported, but not mapped due to some unspecified limitation
QUANTIZE                       3          Mapped to Edge TPU
CONV_2D                        115        Mapped to Edge TPU
CONV_2D                        4          More than one subgraph is not supported
DEQUANTIZE                     1          Operation is working on an unsupported data type
RESIZE_BILINEAR                2          Operation is otherwise supported, but not mapped due to some unspecified limitation
RESIZE_BILINEAR                6          Mapped to Edge TPU
SOFTMAX                        1          Max 16000 elements supported

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 9
  • Comments: 23 (10 by maintainers)

Most upvoted comments

Hi,

Apologies that this issue has gone stale. Some additional x86 optimizations have landed (for AVX, AVX2, and AVX512) and they will soon be the default on x86, but aren’t yet. For this issue, it would be good to know if the poor performance persists for you on x86 CPU. Can you please do as follows:

  1. Please build with:
bazel build -c opt --define=tflite_with_ruy=true -copt=-DRUY_PROFILER
  1. Please run the benchmark_model tool with --enable_op_profiling=true

and then post the output to this issue. Also, please provide your exact build line for any executable you are running. Thanks!

any plan on make this default?

Hi all, The model speeds up after deploying on the edgetpu (compiling from cpu tflite to edgetpu tflite) for both cases so it is the expect behavior. Since the compiler can only delegates from a fully quantized cpu tflite model, it can’t do much about the original graph. It seems very odd to me that tflite model is performing much worse than the original graph model though.

It’s also good to mention that I’ve also observed similar behavior (tiny bit difference) when testing out a yolov4 model (note that unfortunately, only 1/962 ops were mapped to edgetpu so we don’t see much speed up here):

On my x86_64 debian 10:
Original model: ~55 seconds on CPU
(non quantized) tflite modes: ~5 seconds
(fully quantized) tflite model: ~56 seconds
(edgetpu) tflite model: ~55 seconds

A quick look into the model with netron, I can see many quantized/dequantized ops that I suspect is what’s causing the slowdown. Again, tflite models wasn’t optimized for x86_64, so I suspect this the issue.

Now let’s check this again on my dev board, everything is as expected:

On my dev board:
Original model: Unfortunately cannot run this on the dev board.
(non quantized) tflite modes: ~ 27 seconds
(fully quantized) tflite model: ~ 13 seconds
(edgetpu) tflite model: ~12 seconds

My suggestion for all is to run the tflite model on arm platform since that’s what tflite models are optimized for. If you are benchmarking a tflite model against a CPU graph model is not ideal.

Hope these finding is helpful!

It would be interesting to hear if performance is significantly different if you build with this flag,

bazel build -c opt --define=tflite_with_ruy=true

ruy is not heavily optimized for x86 as it is for ARM, which is part of why it isn’t the default yet, but it might already perform better than the default.

However, ruy is only an implementation of matrix multiplication. If your model spends most of its time in other nodes, it will run into the fact that tflite’s operators are implemented in NEON intrinsics, compiling on x86 thanks to a NEON->SSE intrinsics translation header. In other words, the compromise here has been minimal x86 implementation effort at the expense of x86 performance. It is to be expected that another inference engine with more first-class x86 implementation would outperform it, as mentioned in the previous comment.