tensorflow: Slow quantized graph

On Ubuntu 15.10 with CUDA 7.5, cuDNN 7.0, tensorflow-0.9.0rc0, ran “tensorflow/examples/label_image/” application by taking inception-v3 graph and roughly measure the elapsed time.
Then take “tensorflow/contrib/quantization/tools:quantize_graph” to quant inception-v3, rebuilt application by giving
```
"//tensorflow/contrib/quantization:cc_ops",
"//tensorflow/contrib/quantization/kernels:quantized_ops",
```

into “tensorflow/examples/label_image/BUILD” and redo the same classification and measure the time.

Before/After quantization, elapsed time were 6 seconds vs. 17 seconds, i.e. quantization doubled the inference time?

The results looks ok as below so I think I was running it correctly. Before

military uniform (866): 0.647299
suit (794): 0.0477195
academic gown (896): 0.0232407
bow tie (817): 0.0157355
bolo tie (940): 0.0145023

After

military uniform (866): 0.703474
suit (794): 0.0248454
bow tie (817): 0.0171362
bolo tie (940): 0.0171362
academic gown (896): 0.0164432

My tensor flow was built as CPU only. Have also tried to enable GPU while the timing didn’t change. Do we know what the expected performance would be?

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 13
Comments: 39 (12 by maintainers)

Commits related to this issue

嘗試測試h5轉lite版經過實驗發現無法使用根據 https://github.com/tensorflow/tensorflow/issues/2807 量化模型只能用在ARM 架構（mac=x86）因此在mac環境下發現量化後模型還變慢.... 量化前：模型的247 mb h5 single cost time: 1.760 sec h5 single cost time: 0.7... — committed to jason9075/tensorflow_playground by jason9075 5 years ago

Most upvoted comments

Quantized ops currently only work on the CPU, because most GPUs don’t support eight-bit matrix multiplications natively. I have just seen that the latest TitanX Pascal cards offer eight-bit support though, so I’m hoping we will be able to use that in the future.

petewarden on Oct 17, 2016

We are focusing our eight-bit efforts on TF Lite (visible at tensorflow/contrib/lite), so we aren’t expecting TensorFlow’s quantized performance to improve in cases where it’s not currently fast. These tend to be on x86 platforms (we’re concentrating on ARM performance for mobile), and for models that use ops that we don’t have quantized implementations for (which is most models outside a few vision-related ones we’ve optimized for).

Since we’re not likely to see changes in this area soon, I’m closing this as infeasible. Pull requests or other help in this area would be very welcome of course!

petewarden on Jan 29, 2018

The quantization is aimed at mobile performance, so most of the optimizations are for ARM not x86. We’re hoping to get good quantization on Intel eventually, but we don’t have anyone actively working on it yet.

petewarden on Jun 16, 2017