tensorflow: Slow quantized graph

  1. On Ubuntu 15.10 with CUDA 7.5, cuDNN 7.0, tensorflow-0.9.0rc0, ran “tensorflow/examples/label_image/” application by taking inception-v3 graph and roughly measure the elapsed time.

  2. Then take “tensorflow/contrib/quantization/tools:quantize_graph” to quant inception-v3, rebuilt application by giving

    "//tensorflow/contrib/quantization:cc_ops",
    "//tensorflow/contrib/quantization/kernels:quantized_ops",
    

into “tensorflow/examples/label_image/BUILD” and redo the same classification and measure the time.

Before/After quantization, elapsed time were 6 seconds vs. 17 seconds, i.e. quantization doubled the inference time?

The results looks ok as below so I think I was running it correctly. Before

  • military uniform (866): 0.647299
  • suit (794): 0.0477195
  • academic gown (896): 0.0232407
  • bow tie (817): 0.0157355
  • bolo tie (940): 0.0145023

After

  • military uniform (866): 0.703474
  • suit (794): 0.0248454
  • bow tie (817): 0.0171362
  • bolo tie (940): 0.0171362
  • academic gown (896): 0.0164432

My tensor flow was built as CPU only. Have also tried to enable GPU while the timing didn’t change. Do we know what the expected performance would be?

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Reactions: 13
  • Comments: 39 (12 by maintainers)

Commits related to this issue

Most upvoted comments

Quantized ops currently only work on the CPU, because most GPUs don’t support eight-bit matrix multiplications natively. I have just seen that the latest TitanX Pascal cards offer eight-bit support though, so I’m hoping we will be able to use that in the future.

We are focusing our eight-bit efforts on TF Lite (visible at tensorflow/contrib/lite), so we aren’t expecting TensorFlow’s quantized performance to improve in cases where it’s not currently fast. These tend to be on x86 platforms (we’re concentrating on ARM performance for mobile), and for models that use ops that we don’t have quantized implementations for (which is most models outside a few vision-related ones we’ve optimized for).

Since we’re not likely to see changes in this area soon, I’m closing this as infeasible. Pull requests or other help in this area would be very welcome of course!

The quantization is aimed at mobile performance, so most of the optimizations are for ARM not x86. We’re hoping to get good quantization on Intel eventually, but we don’t have anyone actively working on it yet.