tensorflow: Why is TensorFlow 2 much slower than TensorFlow 1?

It’s been cited by many users as the reason for switching to Pytorch, but I’ve yet to find a justification / explanation for sacrificing the most important practical quality, speed, for eager execution.

Below is code benchmarking performance, TF1 vs. TF2 - with TF1 running anywhere from 47% to 276% faster.

My question is: what is it, at the graph or hardware level, that yields such a significant slowdown?

Looking for a detailed answer - am already familiar with broad concepts. Relevant SO

Specs: CUDA 10.0.130, cuDNN 7.4.2, Python 3.7.4, Windows 10, GTX 1070

Benchmark results:

Benchmark code:

# use tensorflow.keras... to benchmark tf.keras; used GPU for all above benchmarks
from keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from keras.layers import Flatten, Dropout
from keras.models import Model
from keras.optimizers import Adam
import keras.backend as K
import numpy as np
from time import time

batch_shape = (32, 400, 16)
X, y = make_data(batch_shape)

model_small = make_small_model(batch_shape)
model_small.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_small.train_on_batch, 200, X, y)

K.clear_session()  # in my testing, kernel was restarted instead

model_medium = make_medium_model(batch_shape)
model_medium.train_on_batch(X, y)  # skip first iteration which builds graph
timeit(model_medium.train_on_batch, 10, X, y)

Functions used:

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,   activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model
    
def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 2, (batch_shape[0], 1))

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 36
Comments: 44 (29 by maintainers)

Links to this issue

python - Why is TensorFlow 2 much slower than TensorFlow 1? - Stack Overflow

Most upvoted comments

Sorry for the very late reply. Let me comment on the performance aspect for Keras, between v1 and v2, also eager/graph.

In TF2, as we all know, eager mode becomes the default context, and you can disable it by tf.compat.v1.disable_eager_execution(). In eager mode, the runtime need to execute the ops and return the numerical value for every line of python code. The nature of single step execution cause it to be slow, and mainly good for debug only. To overcome the slowness in eager mode, we have @tf.function, which will turn a python function into a graph. When feed numerical value like np array, the body of the tf.function is converted into static graph, being optimized, and return the final value, which is fast and should have similar performance as TF1 graph mode.

Also in TF2, Keras made some change to leverage tf.function to build its graph for training, eval and prediction. We call them “execution function” for the model. In TF1, the “execution function” was a FuncGraph, which shared some common component as TF function, but has a different implementation. We made the decision to make this change since we observe the performance gain when we test some large model before we cut the 2.0 release. During the process of the change, we correctly update the model.fit/eval/predict function, but somehow leave an incorrect implementation for train_on_batch(), test_on_batch() and predict_on_batch(). They are still numerically correct, but the execution function for x_on_batch is a pure python function, rather than a tf.function wrapped python function. This will cause the slowness as you observed above, and the reason is as I stated in paragraph 1.

You can find more details about the change history for keras change in https://github.com/tensorflow/tensorflow/commit/b389d0b8f3dc15907c5cea908c4cbbbdb75fc862.

I am taking the performance part of the issue from here. For now, if you want to do some performance analysis, please try model.fit/eval/predict() until this issue is fixed. The first batch of the first epoch will have some overhead for execution function initialization and function trace, which you should ignore. The following batches and epochs should have the correct performance.

In general, we would suggest user stay with the eager runtime, since this is the focus for us currently. More updates will come in future to make the eager runtime faster.

Hope this resolve your question, and thanks again for reporting this issue in details.

+21

qlzh727 on Oct 30, 2019

Hi,

Thank you for a very interesting performance report. I replicated the small model example and tried to see what happened when enabling or disabling Eager execution and found the following results (note that I am always using tensorflow.keras):

TF 2.0 with Eager on: 0.0361 s/iter TF 2.0 without Eager: 0.0177 s/iter TF 1.14 without Eager: 0.0167 s/iter TF 1.14 with Eager on: 0.0169 s/iter

It would therefore appear that disabling Eager is beneficial in 2.0, but not so much in 1.14. I am therefore inclined to believe that the issue at stage is not so much related to Eager itself (since using keras should result in building the Model in graph mode anyway), but to side modifications to the training loop in 2.0; specifically, I am wondering whether this is related to the data handling changes, which if I am not mistaken have everything be reformatted as a tensorflow.data.Dataset in 2.0. Edit: I ran additional tests pre-wrapping data as a Dataset, it does not solve the issue. The source code clearly indicates that the functions called as a backend to train_on_batch are different depending on whether Eager is enabled or not, but I was not able to figure out what makes the eager-enabled v2 function slower than its alternatives.

pandrey-fr on Oct 18, 2019

THANK YOU for that very detailed answer @OverLordGoldDragon ! I’ve skimmed through it (there’s a lot there) and while it’s good to know Tensorflow 2.0 CAN be as fast as TF 1, it seems very finicky to get right. Reinforcement learning already is sensitive to hyperparameters and can be difficult to debug, so adding another layer of having to set up the code correctly to get it to train quickly is a huge turn off. I guess this is a question for TF devs - is there a near term future where TF 2.0 can ALWAYS run fast? Until then I don’t see a reason not to use PyTorch instead, which seems to have all the benefits without this huge cost / added complexity.

lukemadera on Nov 1, 2019

A CPU user reported a phenomenon of periodic (and increasing) inference-time spikes in TF2, not seen in TF1: from the SO:

Using model(x) opposed to .predict() largely solved the problem - regardless, something to note.

OverLordGoldDragon on Feb 19, 2020

I ran some additional tests, investigating runtimes of tensorflow.keras.Model.fit rather than that of the train_on_batch method. To do so, I slightly altered the code submitted by @OverLordGoldDragon to generate 10 data batches and wrap them with a tensorflow.data.Dataset ; I then measured the mean time to run 10 fit calls, after having run an initial train_on_batch to exclude graph building time from the reported measures.

In this setting, it appears that Eager execution speeds things up in both 2.0 and 1.14, and that 2.0 yields lower runtimes for both the small and medium models. Note that for the latter, some optimization on LSTM kernels handling is at work, in addition to the training loop modifications.

Results:

small model:

TF 2.0 with Eager on: 0.1586 sec/fit TF 2.0 without Eager: 0.3201 sec/fit TF 1.14 without Eager: 0.3198 sec/fit TF 1.14 with Eager on: 0.1855 sec/fit

medium model:

TF 2.0 with Eager on: 18.6217 sec/fit TF 2.0 without Eager: 19.1296 sec/fit TF 1.14 without Eager: 41.8126 sec/fit TF 1.14 with Eager on: failed (GPU memory exhausted / tensors initialization issue)

setup:

Linux Mint 19.2, Python 3.6.8, Tensorflow 1.14 or 2.0.0 both with GPU enabled, CUDA 10.0, cuDNN 7.4, NVidia Quadro P1000 (4GB of dedicated RAM)

Code:


from time import time

import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, LSTM, Bidirectional, Conv1D
from tensorflow.keras.layers import Flatten, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# to disable Eager in 2.0:  tf.enable_eager_execution()
# to enable Eager in 1.14: tf.compat.v1.disable_eager_execution()

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_small_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Conv1D(128, 400, strides=4, padding='same')(ipt)
    x     = Flatten()(x)
    x     = Dropout(0.5)(x)
    x     = Dense(64, activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_medium_model(batch_shape):
    ipt   = Input(batch_shape=batch_shape)
    x     = Bidirectional(LSTM(512, activation='relu', return_sequences=True))(ipt)
    x     = LSTM(512, activation='relu', return_sequences=True)(x)
    x     = Conv1D(128, 400, strides=4, padding='same')(x)
    x     = Flatten()(x)
    x     = Dense(256, activation='relu')(x)
    x     = Dropout(0.5)(x)
    x     = Dense(128, activation='relu')(x)
    x     = Dense(64,  activation='relu')(x)
    out   = Dense(1,  activation='sigmoid')(x)
    model = Model(ipt, out)
    model.compile(Adam(lr=1e-4), 'binary_crossentropy')
    return model

def make_data(batch_shape, n_batches):
    data = np.random.randn(n_batches, *batch_shape),
    trgt = np.random.randint(0, 2, (n_batches, batch_shape[0], 1))
    return tf.data.Dataset.from_tensor_slices((data, trgt))

batch_shape = (32, 400, 16)
data = make_data(batch_shape, n_batches=10)

model = make_medium_model(batch_shape)  # OR change to make_small_model
model.train_on_batch(data.take(1))
timeit(lambda: model.fit(data, steps_per_epoch=10), 10)

pandrey-fr on Oct 18, 2019

I hava similar problem when I use keras bert. tf.function is hard to use too

zhouygg on Oct 18, 2019

@OverLordGoldDragon That is because:

tensorflow and tensorflow-gpu were different in the past and by installing tensorflow-gpu you could get cudatoolkit and cudnn as well.
but tensorflow recently remove the tensorflow-gpu command, so to install any type of tensorflow you need to use ‘pip install tensorflow’ BUT the cudatoolkit and cudnn won’t be install by default anymore. that is why the speed is lower. to increase the speed you need to install cudatoolkit or cudnn … separately using: ‘pip install cuda-python’ or ‘conda install -c nvidia cuda-python’

lghasemzadeh on Mar 1, 2023

Yup, the nightly build is much, much faster than 2.0.

I won’t bother with performance tests if the behaviour is expected. Just wanted to chip in if anyone - like me- came from a freshly installed 2.0 and saw a massive performance drop from earlier alpha or nightly. Looking forward to 2.1 as well!

tobiasmorville on Dec 12, 2019

ANSWERED in detail. I verified the information to best of my ability, but would still appreciate an expert review, @qlzh727 . If anything looks off or could be improved, please let me know.

Included are some rather ‘interesting’ results, some of which contradict a few of your statements, @qlzh727.

That took a while, but was interesting - going to take a break.

OverLordGoldDragon on Nov 1, 2019

New info in: fit() is ~~faster, but is still slower than TF2’s~~ approx. as fast as train_on_batch() after accounting for the “warmup”.

Before this image, the average was 4.65 secs - these are the 50 fit() calls after restarting the kernel and running again. I noticed in the first one that iterations were as fast as 3 secs. It doesn’t seem to be a memory leak as far as Task Manager’s reporting of Dedicated memory usage goes, or RAM. Also, here’s the plot after the one above:

It averages 6.03 secs, even worse. Now I did notice that GPU % usage (Task Manager) spikes to 100% at each iteration, so maybe this is GPU throttling - but that didn’t happen with TF2 no matter how many times I re-ran the Large-Large test. modified timeit

Update: 300 iters, avg. 5.75.

from tensorflow.python.platform import build_info
print(build_info.cuda_version_number)   # 10.1
print(build_info.cudnn_version_number)  # 7
print(build_info.cudnn_dll_name)        # 'cudnn64_7.dll'
print(build_info.cudart_dll_name)       # 'cudart64_101.dll'

OverLordGoldDragon on Feb 12, 2020

@tomonodes To clarify, you’re saying that 2.0 is much slower than tf-nightly? If so, that is expected behavior; there are a number of key performance improvements that were made to TensorFlow after the 2.0 release that are present in HEAD (where tf-nightly is built from) and will be in 2.1. On the other hand, if tf-nightly is slower than 2.0 then that’s a problem.

robieta on Dec 11, 2019

That’s a correct understanding of tf-nightly

mihaimaruseac on Nov 15, 2019

@robieta So basically, tf-nightly is the current master branch, installable via !pip? If so, the concept of “nightly” finally makes sense to me: “all the stuff since last stable release and before next”.

My env is fine, but yes, a simple copy can easily break - the problem is in installing w/o a package manager which correctly unpacks (and ignores) the repository files. If nightly is what I described, I’ll be testing it soon – thanks.

OverLordGoldDragon on Nov 15, 2019

@OverLordGoldDragon Thank you a lot for your very thorough and detailed testing and analysis!

pandrey-fr on Nov 1, 2019

Fair enough on the code scope, though it’s already <200 lines and the error logging was fairly opaque so I don’t know which code snippets to paste specifically. And I don’t have any keras imports at all, so those are consistent already I believe.

disabling eager in TF2 can be tricky <-- Yeah, I think that’s the core issue. Even more core, that it needs to be disabled in the first place to get tensorflow v1.x performance. I used the tensorflow 2.0 tutorials to build that fairly simple code and I think we need to get to a state where the tutorials can be followed as is WITH good performance.

lukemadera on Oct 25, 2019

@lukemadera Partly from this PR, try replacing all K.get_value() and K.eval() in your code with below, and try both of import keras.backend as K and import tensorflow.keras.backend as K:

def K_eval(x):
    try:
        return K.get_value(K.to_dense(x))
    except:
        try:
            eval_fn = K.function([], [x])
            return eval_fn([])[0]
        except:
            return K.eager(K.eval)(x)

OverLordGoldDragon on Oct 25, 2019