tensorflow: AOT compiled graph is 2-7x slower than Python

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): 2.2
Python version: 3.6.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: -
GPU model and memory: -

Describe the current behavior

For tf.matmul and tf.linalg.triangular_solve AOT is much slower (2-7x) than the equivalent TF-Python version. For triangular_solve, which is easy to hand-code, this also applies wrt bespoke C implementation.

Describe the expected behavior

I expect that the computation time of the AOT compiled graph is better than the Python counterpart.

Standalone code to reproduce the issue

The graph is the following:

def trisolve(A, b):
    """ Builds a graph. A is a lower-triangular MxM matrix, b is a Mx1 column vector """    
    res = tf.linalg.triangular_solve(A, b, lower=True)    
    return res

M = 2048
predict_fn = tf.function(trisolve,
        input_signature=[tf.TensorSpec(shape=[M,M], dtype=tf.float64, name='A'),
        tf.TensorSpec(shape=[M,1], dtype=tf.float64, name='b')], experimental_compile=False)
    
module_to_save = tf.Module()
module_to_save.predict = predict_fn
tf.saved_model.save(module_to_save, 'saved_model', signatures={'serving_default': module_to_save.predict})

The graph is then compiled with:

$ cd saved_model
$ saved_model_cli aot_compile_cpu --checkpoint_path .\variables\variables --dir . --signature_def_key serving_default --target_triple x86_64-none-windows --cpp_class trisolve --output_prefix libs64/libtrisolve --tag_set serve

and is linked to a simple cpp file that just runs the model for benchmark.

#include <iostream>
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "libtrisolve.h" // generated
#include <chrono>
#include <fstream>

using namespace std::chrono;

#define M 2048

int main(int argc, char** argv) {  
  trisolve model;  

  int i, j, idx;
  double *test_A;   // Initialize with a MxM lower triangular matrix
  double *test_b;   // Initialize with a Mx1 vector

  std::copy(test_A, test_A+M*M, model.arg0_data());
  std::copy(test_b, test_b+M, model.arg1_data());
  
  int N = 1000;
  auto tStart = high_resolution_clock::now(); 
  for (int i = 0; i < N; i++)
    model.Run();
  auto tEnd = high_resolution_clock::now();
  auto duration = duration_cast<microseconds>((tEnd - tStart)/N); 
  std::cout << "Time TF Compiled: " << duration.count() << "us" << std::endl;

  return 0;
}

Performance Issue:

Triangular Solve (as above):

A very simple single-threaded C++ implementation takes just 2ms;
Running the triangular solve from the Python code, takes ~3ms;
Running the compiled executable, instead, takes 17ms.

MatMul (multiplication of two square matrices 2048x2048):

Python version takes ~150ms
Compiled version takes ~370ms

About this issue

Original URL
State: open
Created 4 years ago
Reactions: 1
Comments: 16 (7 by maintainers)

Most upvoted comments

@sushreebarsa Thank you for your reply, I’ll have a look as soon as possible!

battuzz on Dec 14, 2021