tensorflow: TFLite TransposeConvV2 Operator Slow on x86 CPU Ubuntu
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
- TensorFlow installed from (source or binary): Binary
- TensorFlow version (use command below): tf-nightly 2.4.0.dev20200902
- Python version: 3.6.9
- Bazel version (if compiling from source): 3.5.0
- GCC/Compiler version (if compiling from source): gcc 7.5.0
- CUDA/cuDNN version: CUDA 10.1
- GPU model and memory: Using CPU (Intel® Core™ i7-8086K CPU @ 4.00GHz)
Describe the current behavior
The TRANSPOSE_CONV
operator takes up >80% of total computation time when using the TFLite benchmarking tool.
============================ Top by Computation Time ==============================
[node type] [start] [first] [avg ms] [%] [cdf%] [mem KB] [times called] [Name]
TRANSPOSE_CONV 235.882 487.948 482.097 26.927% 26.927% 0.000 1 [device_0/g_ae/dec_0/conv2d_transpose1]:102
TRANSPOSE_CONV 719.253 163.323 162.187 9.059% 35.986% 0.000 1 [device_0/g_ae/dec_1/conv2d_transpose1]:113
TRANSPOSE_CONV 1634.919 108.162 111.988 6.255% 42.241% 0.000 1 [device_0/g_ae/dec_9/conv2d_transpose1]:201
TRANSPOSE_CONV 1501.813 116.089 109.471 6.114% 48.355% 0.000 1 [device_0/g_ae/dec_8/conv2d_transpose1]:190
TRANSPOSE_CONV 882.714 111.952 108.459 6.058% 54.413% 0.000 1 [device_0/g_ae/dec_2/conv2d_transpose1]:124
TRANSPOSE_CONV 993.583 103.796 97.807 5.463% 59.876% 0.000 1 [device_0/g_ae/dec_3/conv2d_transpose1]:135
TRANSPOSE_CONV 1287.885 92.329 95.829 5.352% 65.229% 0.000 1 [device_0/g_ae/dec_6/conv2d_transpose1]:168
TRANSPOSE_CONV 1394.631 109.527 95.786 5.350% 70.579% 0.000 1 [device_0/g_ae/dec_7/conv2d_transpose1]:179
TRANSPOSE_CONV 1093.908 92.043 93.959 5.248% 75.827% 0.000 1 [device_0/g_ae/dec_4/conv2d_transpose1]:146
TRANSPOSE_CONV 1193.164 88.003 89.509 4.999% 80.826% 0.000 1 [device_0/g_ae/dec_5/conv2d_transpose1]:157
Number of nodes executed: 216
============================== Summary by node type ==============================
[Node type] [count] [avg ms] [avg %] [cdf %] [mem KB] [times called]
TRANSPOSE_CONV 11 1466.204 81.898% 81.898% 0.000 11
CONV_2D 11 159.086 8.886% 90.785% 0.000 11
ABS 21 111.310 6.217% 97.002% 0.000 21
ADD 32 11.551 0.645% 97.647% 0.000 32
MUL 42 10.792 0.603% 98.250% 0.000 42
RESHAPE 44 9.645 0.539% 98.789% 0.000 44
SUB 21 9.366 0.523% 99.312% 0.000 21
RELU 21 6.514 0.364% 99.676% 0.000 21
CONCATENATION 11 5.129 0.286% 99.962% 0.000 11
TfLiteFlexDelegate 1 0.430 0.024% 99.986% 0.000 1
TANH 1 0.245 0.014% 100.000% 0.000 1
Timings (microseconds): count=50 first=1811076 curr=1775110 min=1738229 max=1895058 avg=1.79038e+06 std=32080
Describe the expected behavior Faster execution of this operator. I expected my model to run inference faster once I converted to TFLite, but currently it is running more slowly than regular tensorflow on the same hardware.
Standalone code to reproduce the issue
bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
--graph=.../converted_model_float32.tflite --num_threads=4 --enable_op_profiling=true > .../float32_benchmark.txt
(The benchmarking tool was built from tf master source commit 86db5756535f70f1b1fab61c6f3f0483141510e8)
Other info / logs Full TFLite benchmark output
I’d appreciate any tips for how I could profile this operator. How can I find out why it is taking so much time in my network? Can I use C++ profiling tools to find the computation time sinks in the transpose_conv.h?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 15 (15 by maintainers)
It is worth pointing out that this model is quite large (peak memory footprint > 800 MB) and 1.8 seconds inference time on ARM when actually getting optimized code. ~90% of the execution time is in GEMM in ARM case, so I think if we are able to get the AVX2 path going on your machine, that’s probably the low hanging fruit here.
Hi,
Unfortunately I don’t have a full answer but I can provide some info. First, can you say what the run time is for TF vs. TFL on x86 with this model built/run in a standard way (basically just
-c opt
for the build line)? If TFL is 2-3x slower than TF, it is likely something bad is happening. If TFL is on the order of 10-20% slower than TF for this float model, it might just mean that deeper work is required on the TFL side to optimize transpose conv (so no easy fix available). For certain kinds of TF Ops, or TF Ops with certain options (like certain stride values) the transformation from TF -> TFL can result in a correct but unoptimized graph. So, the main thing to say is that there might be no magic bullet on the performance side.Regarding the runs with Ruy, I am able to see something that is going wrong. When you add the
--define=ruy_profiler=true
the resulting output is really helpful to figure out performance issues. Please use it by default as we keep debugging this issue. In this case, the output:shows that you are getting the “standard Cpp” path, which is unoptimized naive code only used for debugging purposes or if no optimized code path could be found. On x86, Ruy has optimized kernels for AVX, AVX2, and AVX512. From the CPU you indicated, it appears that you do have AVX2 on your machine, at least according to this
Ruy should autodetect the platform and choose the most performant path (AVX2 in this case), but for some reason it is not doing that. You can manually set the path with the env variable
RUY_PATHS
, so as an experiment you can do:Although if you hit an illegal instruction for your path the code will crash. Please give it a try and let me know the result.