tensorflow: AOT compiled graph is 2-7x slower than Python
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.2
- Python version: 3.6.8
- Bazel version (if compiling from source):
- GCC/Compiler version (if compiling from source):
- CUDA/cuDNN version: -
- GPU model and memory: -
Describe the current behavior
For tf.matmul and tf.linalg.triangular_solve AOT is much slower (2-7x) than the equivalent TF-Python version. For triangular_solve, which is easy to hand-code, this also applies wrt bespoke C implementation.
Describe the expected behavior
I expect that the computation time of the AOT compiled graph is better than the Python counterpart.
Standalone code to reproduce the issue
The graph is the following:
def trisolve(A, b):
""" Builds a graph. A is a lower-triangular MxM matrix, b is a Mx1 column vector """
res = tf.linalg.triangular_solve(A, b, lower=True)
return res
M = 2048
predict_fn = tf.function(trisolve,
input_signature=[tf.TensorSpec(shape=[M,M], dtype=tf.float64, name='A'),
tf.TensorSpec(shape=[M,1], dtype=tf.float64, name='b')], experimental_compile=False)
module_to_save = tf.Module()
module_to_save.predict = predict_fn
tf.saved_model.save(module_to_save, 'saved_model', signatures={'serving_default': module_to_save.predict})
The graph is then compiled with:
$ cd saved_model
$ saved_model_cli aot_compile_cpu --checkpoint_path .\variables\variables --dir . --signature_def_key serving_default --target_triple x86_64-none-windows --cpp_class trisolve --output_prefix libs64/libtrisolve --tag_set serve
and is linked to a simple cpp file that just runs the model for benchmark.
#include <iostream>
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"
#include "libtrisolve.h" // generated
#include <chrono>
#include <fstream>
using namespace std::chrono;
#define M 2048
int main(int argc, char** argv) {
trisolve model;
int i, j, idx;
double *test_A; // Initialize with a MxM lower triangular matrix
double *test_b; // Initialize with a Mx1 vector
std::copy(test_A, test_A+M*M, model.arg0_data());
std::copy(test_b, test_b+M, model.arg1_data());
int N = 1000;
auto tStart = high_resolution_clock::now();
for (int i = 0; i < N; i++)
model.Run();
auto tEnd = high_resolution_clock::now();
auto duration = duration_cast<microseconds>((tEnd - tStart)/N);
std::cout << "Time TF Compiled: " << duration.count() << "us" << std::endl;
return 0;
}
Performance Issue:
Triangular Solve (as above):
- A very simple single-threaded C++ implementation takes just 2ms;
- Running the triangular solve from the Python code, takes ~3ms;
- Running the compiled executable, instead, takes 17ms.
MatMul (multiplication of two square matrices 2048x2048):
- Python version takes ~150ms
- Compiled version takes ~370ms
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 1
- Comments: 16 (7 by maintainers)
@sushreebarsa Thank you for your reply, I’ll have a look as soon as possible!