tensorflow: TFLite TransposeConvV2 Operator Slow on x86 CPU Ubuntu

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • TensorFlow installed from (source or binary): Binary
  • TensorFlow version (use command below): tf-nightly 2.4.0.dev20200902
  • Python version: 3.6.9
  • Bazel version (if compiling from source): 3.5.0
  • GCC/Compiler version (if compiling from source): gcc 7.5.0
  • CUDA/cuDNN version: CUDA 10.1
  • GPU model and memory: Using CPU (Intel® Core™ i7-8086K CPU @ 4.00GHz)

Describe the current behavior The TRANSPOSE_CONV operator takes up >80% of total computation time when using the TFLite benchmarking tool.

============================ Top by Computation Time ==============================
	             [node type]	          [start]	  [first]	 [avg ms]	     [%]	  [cdf%]	  [mem KB]	[times called]	[Name]
	          TRANSPOSE_CONV	          235.882	  487.948	  482.097	 26.927%	 26.927%	     0.000	        1	[device_0/g_ae/dec_0/conv2d_transpose1]:102
	          TRANSPOSE_CONV	          719.253	  163.323	  162.187	  9.059%	 35.986%	     0.000	        1	[device_0/g_ae/dec_1/conv2d_transpose1]:113
	          TRANSPOSE_CONV	         1634.919	  108.162	  111.988	  6.255%	 42.241%	     0.000	        1	[device_0/g_ae/dec_9/conv2d_transpose1]:201
	          TRANSPOSE_CONV	         1501.813	  116.089	  109.471	  6.114%	 48.355%	     0.000	        1	[device_0/g_ae/dec_8/conv2d_transpose1]:190
	          TRANSPOSE_CONV	          882.714	  111.952	  108.459	  6.058%	 54.413%	     0.000	        1	[device_0/g_ae/dec_2/conv2d_transpose1]:124
	          TRANSPOSE_CONV	          993.583	  103.796	   97.807	  5.463%	 59.876%	     0.000	        1	[device_0/g_ae/dec_3/conv2d_transpose1]:135
	          TRANSPOSE_CONV	         1287.885	   92.329	   95.829	  5.352%	 65.229%	     0.000	        1	[device_0/g_ae/dec_6/conv2d_transpose1]:168
	          TRANSPOSE_CONV	         1394.631	  109.527	   95.786	  5.350%	 70.579%	     0.000	        1	[device_0/g_ae/dec_7/conv2d_transpose1]:179
	          TRANSPOSE_CONV	         1093.908	   92.043	   93.959	  5.248%	 75.827%	     0.000	        1	[device_0/g_ae/dec_4/conv2d_transpose1]:146
	          TRANSPOSE_CONV	         1193.164	   88.003	   89.509	  4.999%	 80.826%	     0.000	        1	[device_0/g_ae/dec_5/conv2d_transpose1]:157

Number of nodes executed: 216
============================== Summary by node type ==============================
	             [Node type]	  [count]	  [avg ms]	    [avg %]	    [cdf %]	  [mem KB]	[times called]
	          TRANSPOSE_CONV	       11	  1466.204	    81.898%	    81.898%	     0.000	       11
	                 CONV_2D	       11	   159.086	     8.886%	    90.785%	     0.000	       11
	                     ABS	       21	   111.310	     6.217%	    97.002%	     0.000	       21
	                     ADD	       32	    11.551	     0.645%	    97.647%	     0.000	       32
	                     MUL	       42	    10.792	     0.603%	    98.250%	     0.000	       42
	                 RESHAPE	       44	     9.645	     0.539%	    98.789%	     0.000	       44
	                     SUB	       21	     9.366	     0.523%	    99.312%	     0.000	       21
	                    RELU	       21	     6.514	     0.364%	    99.676%	     0.000	       21
	           CONCATENATION	       11	     5.129	     0.286%	    99.962%	     0.000	       11
	      TfLiteFlexDelegate	        1	     0.430	     0.024%	    99.986%	     0.000	        1
	                    TANH	        1	     0.245	     0.014%	   100.000%	     0.000	        1

Timings (microseconds): count=50 first=1811076 curr=1775110 min=1738229 max=1895058 avg=1.79038e+06 std=32080

Describe the expected behavior Faster execution of this operator. I expected my model to run inference faster once I converted to TFLite, but currently it is running more slowly than regular tensorflow on the same hardware.

Standalone code to reproduce the issue

bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model_plus_flex \
  --graph=.../converted_model_float32.tflite --num_threads=4 --enable_op_profiling=true > .../float32_benchmark.txt

(The benchmarking tool was built from tf master source commit 86db5756535f70f1b1fab61c6f3f0483141510e8)

Other info / logs Full TFLite benchmark output

I’d appreciate any tips for how I could profile this operator. How can I find out why it is taking so much time in my network? Can I use C++ profiling tools to find the computation time sinks in the transpose_conv.h?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 15 (15 by maintainers)

Most upvoted comments

It is worth pointing out that this model is quite large (peak memory footprint > 800 MB) and 1.8 seconds inference time on ARM when actually getting optimized code. ~90% of the execution time is in GEMM in ARM case, so I think if we are able to get the AVX2 path going on your machine, that’s probably the low hanging fruit here.

Hi,

Unfortunately I don’t have a full answer but I can provide some info. First, can you say what the run time is for TF vs. TFL on x86 with this model built/run in a standard way (basically just -c opt for the build line)? If TFL is 2-3x slower than TF, it is likely something bad is happening. If TFL is on the order of 10-20% slower than TF for this float model, it might just mean that deeper work is required on the TFL side to optimize transpose conv (so no easy fix available). For certain kinds of TF Ops, or TF Ops with certain options (like certain stride values) the transformation from TF -> TFL can result in a correct but unoptimized graph. So, the main thing to say is that there might be no magic bullet on the performance side.

Regarding the runs with Ruy, I am able to see something that is going wrong. When you add the --define=ruy_profiler=true the resulting output is really helpful to figure out performance issues. Please use it by default as we keep debugging this issue. In this case, the output:

       * 12.38% TrMul (Path=0x1, max_num_threads=4, is_prepacked=(0,0))
          * 12.38% TrMulImpl, general case
            * 11.42% Kernel (Standard Cpp)

shows that you are getting the “standard Cpp” path, which is unoptimized naive code only used for debugging purposes or if no optimized code path could be found. On x86, Ruy has optimized kernels for AVX, AVX2, and AVX512. From the CPU you indicated, it appears that you do have AVX2 on your machine, at least according to this

Ruy should autodetect the platform and choose the most performant path (AVX2 in this case), but for some reason it is not doing that. You can manually set the path with the env variable RUY_PATHS, so as an experiment you can do:

RUY_PATHS=0x20  <rest of  execution line here>

Although if you hit an illegal instruction for your path the code will crash. Please give it a try and let me know the result.