tensorflow: Why is TensorFlow 2 much slower than TensorFlow 1?
It’s been cited by many users as the reason for switching to Pytorch, but I’ve yet to find a justification / explanation for sacrificing the most important practical quality, speed, for eager execution.
Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.
My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?
Looking for a detailed answer - am already familiar with broad concepts. Relevant SO
Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070
Benchmark results:

Benchmark code:
# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time
batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)
model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)
K.clear_session() # in my testing, kernel was restarted instead
model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y) # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)
Functions used:
def timeit(func, iterations, *args):
t0 = time()
for _ in range(iterations):
func(*args)
print("Time/iter: %.4f sec" % ((time() - t0) / iterations))
def make_small_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Conv1D(128, 400, strides=4, padding='same')(ipt)
x = Flatten()(x)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_medium_model(batch_shape):
ipt = Input(batch_shape=batch_shape)
x = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
x = LSTM(512, activation='relu', return_sequences=True)(x)
x = Conv1D(128, 400, strides=4, padding='same')(x)
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
out = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)
model.compile(Adam(lr=1e-4), 'binary_crossentropy')
return model
def make_data(batch_shape):
return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 36
- Comments: 44 (29 by maintainers)
Sorry for the very late reply. Let me comment on the performance aspect for Keras, between v1 and v2, also eager/graph.
In TF2, as we all know, eager mode becomes the default context, and you can disable it by tf.compat.v1.disable_eager_execution(). In eager mode, the runtime need to execute the ops and return the numerical value for every line of python code. The nature of single step execution cause it to be slow, and mainly good for debug only. To overcome the slowness in eager mode, we have @tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.
Also in TF2, Keras made some change to leverage tf.function to build its graph for training, eval and prediction. We call them “execution function” for the model. In TF1, the “execution function” was a FuncGraph, which shared some common component as TF function, but has a different implementation. We made the decision to make this change since we observe the performance gain when we test some large model before we cut the 2.0 release. During the process of the change, we correctly update the model.fit/eval/predict function, but somehow leave an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause the slowness as you observed above, and the reason is as I stated in paragraph 1.
You can find more details about the change history for keras change in https://github.com/tensorflow/tensorflow/commit/b389d0b8f3dc15907c5cea908c4cbbbdb75fc862.
I am taking the performance part of the issue from here. For now, if you want to do some performance analysis, please try model.fit/eval/predict() until this issue is fixed. The first batch of the first epoch will have some overhead for execution function initialization and function trace, which you should ignore. The following batches and epochs should have the correct performance.
In general, we would suggest user stay with the eager runtime, since this is the focus for us currently. More updates will come in future to make the eager runtime faster.
Hope this resolve your question, and thanks again for reporting this issue in details.
Hi,
Thank you for a very interesting performance report. I replicated the small model example and tried to see what happened when enabling or disabling Eager execution and found the following results (note that I am always using
tensorflow.keras
):TF 2.0 with Eager on: 0.0361 s/iter TF 2.0 without Eager: 0.0177 s/iter TF 1.14 without Eager: 0.0167 s/iter TF 1.14 with Eager on: 0.0169 s/iter
It would therefore appear that disabling Eager is beneficial in 2.0, but not so much in 1.14. I am therefore inclined to believe that the issue at stage is not so much related to Eager itself (since using keras should result in building the Model in graph mode anyway), but to side modifications to the training loop in 2.0; specifically, I am wondering whether this is related to the data handling changes, which if I am not mistaken have everything be reformatted as a
tensorflow.data.Dataset
in 2.0. Edit: I ran additional tests pre-wrapping data as a Dataset, it does not solve the issue. The source code clearly indicates that the functions called as a backend totrain_on_batch
are different depending on whether Eager is enabled or not, but I was not able to figure out what makes the eager-enabled v2 function slower than its alternatives.THANK YOU for that very detailed answer @OverLordGoldDragon ! I’ve skimmed through it (there’s a lot there) and while it’s good to know Tensorflow 2.0 CAN be as fast as TF 1, it seems very finicky to get right. Reinforcement learning already is sensitive to hyperparameters and can be difficult to debug, so adding another layer of having to set up the code correctly to get it to train quickly is a huge turn off. I guess this is a question for TF devs - is there a near term future where TF 2.0 can ALWAYS run fast? Until then I don’t see a reason not to use PyTorch instead, which seems to have all the benefits without this huge cost / added complexity.
A CPU user reported a phenomenon of periodic (and increasing) inference-time spikes in TF2, not seen in TF1: from the SO:
Using
model(x)
opposed to.predict()
largely solved the problem - regardless, something to note.I ran some additional tests, investigating runtimes of
tensorflow.keras.Model.fit
rather than that of thetrain_on_batch
method. To do so, I slightly altered the code submitted by @OverLordGoldDragon to generate 10 data batches and wrap them with atensorflow.data.Dataset
; I then measured the mean time to run 10fit
calls, after having run an initialtrain_on_batch
to exclude graph building time from the reported measures.In this setting, it appears that Eager execution speeds things up in both 2.0 and 1.14, and that 2.0 yields lower runtimes for both the small and medium models. Note that for the latter, some optimization on LSTM kernels handling is at work, in addition to the training loop modifications.
Results:
small model:
TF 2.0 with Eager on: 0.1586 sec/fit TF 2.0 without Eager: 0.3201 sec/fit TF 1.14 without Eager: 0.3198 sec/fit TF 1.14 with Eager on: 0.1855 sec/fit
medium model:
TF 2.0 with Eager on: 18.6217 sec/fit TF 2.0 without Eager: 19.1296 sec/fit TF 1.14 without Eager: 41.8126 sec/fit TF 1.14 with Eager on: failed (GPU memory exhausted / tensors initialization issue)
setup:
Linux Mint 19.2, Python 3.6.8, Tensorflow 1.14 or 2.0.0 both with GPU enabled, CUDA 10.0, cuDNN 7.4, NVidia Quadro P1000 (4GB of dedicated RAM)
Code:
I hava similar problem when I use keras bert. tf.function is hard to use too
@OverLordGoldDragon That is because:
Yup, the nightly build is much, much faster than 2.0.
I won’t bother with performance tests if the behaviour is expected. Just wanted to chip in if anyone - like me- came from a freshly installed 2.0 and saw a massive performance drop from earlier alpha or nightly. Looking forward to 2.1 as well!
ANSWERED in detail. I verified the information to best of my ability, but would still appreciate an expert review, @qlzh727 . If anything looks off or could be improved, please let me know.
Included are some rather ‘interesting’ results, some of which contradict a few of your statements, @qlzh727.
That took a while, but was interesting - going to take a break.
New info in:
fit()
isfaster, but is still slower than TF2’sapprox. as fast astrain_on_batch()
after accounting for the “warmup”.Before this image, the average was 4.65 secs - these are the 50
fit()
calls after restarting the kernel and running again. I noticed in the first one that iterations were as fast as 3 secs. It doesn’t seem to be a memory leak as far as Task Manager’s reporting of Dedicated memory usage goes, or RAM. Also, here’s the plot after the one above:It averages 6.03 secs, even worse. Now I did notice that GPU % usage (Task Manager) spikes to 100% at each iteration, so maybe this is GPU throttling - but that didn’t happen with TF2 no matter how many times I re-ran the Large-Large test. modified
timeit
Update: 300 iters, avg. 5.75.
@tomonodes To clarify, you’re saying that 2.0 is much slower than tf-nightly? If so, that is expected behavior; there are a number of key performance improvements that were made to TensorFlow after the 2.0 release that are present in HEAD (where tf-nightly is built from) and will be in 2.1. On the other hand, if tf-nightly is slower than 2.0 then that’s a problem.
That’s a correct understanding of
tf-nightly
@robieta So basically,
tf-nightly
is the current master branch, installable via!pip
? If so, the concept of “nightly” finally makes sense to me: “all the stuff since last stable release and before next”.My env is fine, but yes, a simple copy can easily break - the problem is in installing w/o a package manager which correctly unpacks (and ignores) the repository files. If nightly is what I described, I’ll be testing it soon – thanks.
@OverLordGoldDragon Thank you a lot for your very thorough and detailed testing and analysis!
Fair enough on the code scope, though it’s already <200 lines and the error logging was fairly opaque so I don’t know which code snippets to paste specifically. And I don’t have any
keras
imports at all, so those are consistent already I believe.disabling eager in TF2 can be tricky
<-- Yeah, I think that’s the core issue. Even more core, that it needs to be disabled in the first place to get tensorflow v1.x performance. I used the tensorflow 2.0 tutorials to build that fairly simple code and I think we need to get to a state where the tutorials can be followed as is WITH good performance.@lukemadera Partly from this PR, try replacing all
K.get_value()
andK.eval()
in your code with below, and try both ofimport keras.backend as K
andimport tensorflow.keras.backend as K
: