tensorflow: Significant prediction slowdown after model.compile()

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
TensorFlow installed from (source or binary): pip install tensorflow
TensorFlow version: 2.0.0
Python version: 3.7
CUDA/cuDNN version: CUDA=10.0, cuDNN=7.6.4
GPU model and memory: GTX 1060 6GB

Describe the current behavior The prediction speed is slowed down a lot after model.compile() call.

Describe the expected behavior Speed should not be affected. Predict function is used by users assuming that it will work fast because we use it all the time in production. It should not cause surprise to users.

Code to reproduce the issue https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 9
Comments: 25 (16 by maintainers)

Links to this issue

Commits related to this issue

Update docstring for model.predict, to advertise users to model call if performance is a concern when input is small. Detailed info see #33340. PiperOrigin-RevId: 289782248 Change-Id: Ibec02ae1126896... — committed to tensorflow/tensorflow by tanzhenyu 4 years ago
Merge branch 'master' of github.com:tensorflow/tensorflow * 'master' of github.com:tensorflow/tensorflow: (139 commits) [TF:XLA] Enable depthwise convs with depthwise multiplier to use batch_group... — committed to andrewxcav/tensorflow by andrewxcav 4 years ago

Most upvoted comments

Relevant SO, and another minimal reproducible example:

from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
import numpy as np
from time import time

def timeit(func, arg, iterations):
    t0 = time()
    for _ in range(iterations):
        func(arg)
    print("%.4f sec" % (time() - t0))

ipt   = Input(shape=(4,))
x     = Dense(2, activation='relu')(ipt)
out   = Dense(1, activation='sigmoid')(x)
model = Model(ipt, out)

X = np.random.randn(32,4)

timeit(model.predict, X, 1000)
model.compile('adam', loss='binary_crossentropy')
timeit(model.predict, X, 1000)
model._make_train_function()  # build optimizer
timeit(model.predict, X, 1000)

Outputs:

0.9891 sec
29.785 sec
29.521 sec

That’s a 30-fold slowdown. Worse yet, building the optimizer does not elicit any further slowdowns - so “graph size” may not be the main explanation here.

OverLordGoldDragon on Oct 14, 2019

It does not seem to me that it is resolved. It’s more like we know how the issue occurs but we don’t have a solution, just workaround. I need to compare timing between compiled and non-compiled version and see which is faster. But I don’t think a user will be aware of this in general. So we should make a better solution. In this case, should I close the issue or keep it open?

offchan42 on Oct 15, 2019

@ttbrunner Ah yeah that was the commit, thanks.

Ok so we follow the adapter pattern for convert numpy and dataframes to dataset first, and has a single path for execution. Apparently the speed down is mainly two things: 1) the construction of dataset. 2) creating tf.function for predict. (Check TensorLikeDataAdapter under /python/keras/engine/data_adapter.py if you’re interested)

@off99555 @OverLordGoldDragon @ttbrunner So here’s what I would recommend going forward:

you can predict the output using model call, not model predict, i.e., calling model(x) would make this much faster because there are no “conversion to dataset” part, and also it’s directly calling a cached tf.function. However be aware that if you have batch norm layers or any other layers that behaves differently between training and inference, make sure to call it with model(x, training=False)
I will make a docstring to recommend users to use model call and explain predict is for large dataset.

SG?

tanzhenyu on Jan 15, 2020

Hi. Let me try to address some of the questions here and see if that helps.

Has anyone made a docstring PR on this yet?

experimental_run_tf_function is an implementation detail, and that flag is mostly there as a debug during the transition. We don’t plan to document it because it will be removed at some point in the future and the True behavior will be the only behavior.

Now I expect that you may be surprised (or aghast) that it’s going to be always on given the discussion in this thread. What experimental_run_tf_function does is funnel all calls to fit, evaluate, and predict through a central adapter which creates a Dataset and performs a variety of checks and input validation. This is generally desirable because it makes everything more robust, but there is some overhead to spinning up this machinery which is not amortized by small models with little data.

Code to profile the step:

import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
for _ in range(5):
  model.predict(x)
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats("cumtime").print_stats(20)

In this case most of the extra time is spent creating the dataset; there is machinery in there which makes sense, but it’s surplus to requirements for the degenerate case of a single batch. (@jsimsa in case you want to look into the init time, but it’s not obvious that it’s unreasonable given the pipeline.) Really, this is not what model.predict is for. That endpoint is for predictions on lots of data where the batching and aggregation machinery in that endpoint is necessary. For single batch prediction there is model.predict_on_batch, which doesn’t invoke all of that machinery and just directly calls into the model function. I tested it, and it is identical in v1 and v2. (And faster than even v1 model.predict)

However this seems to be a common pitfall; I see lots of issues around single batch model.predict. (@martinwicke increment your counter…) From a documentation standpoint, I think the most valuable contribution would be to document model.predict_on_batch in the model.predict docstring, and probably also warn in model.predict when the batch cardinality is one. @ymodak @jvishnuvardhan Can you remind me to bring this up at the next triage? And @OverLordGoldDragon if you want to take a crack at a PR that would be great; feel free to tag me and I’ll try to provide some assistance.

robieta on Dec 28, 2019

I have updated the doc, also tested the performance for model(x) in nightly. Closing it for now. Thanks all for reporting and collaborative work!

tanzhenyu on Jan 15, 2020

@off99555 I’d agree to request a documentation improvement from TensorFlow to notify users of this, but I doubt any code-level changes will be implemented to address this as it’d require revamping a massive portion of TF graph. It’s up to the user to be aware of functionality differences and adjust accordingly - but admittedly, while this isn’t the only issue where a workaround is required, other cases are at least documented.

OverLordGoldDragon on Oct 15, 2019

@off99555 , Can you confirm if the issue is resolved?Thanks!

oanush on Oct 15, 2019

Thanks for the docstring update, also for the explanation. I’m always interested!

Can confirm that model(x) has the same runtime as predict_on_batch(x), i.e. the v2 path is still slightly slower. It’s OK for my use case though, so thanks again.

Another note for users: it’s possible to specify model.run_eagerly = False before compiling. With this and the model(x) call, I am getting almost the same performance as in v1, without globally disabling eager execution.

P.S.: Sorry for the many edits of this post.

You can also compile(…, run_eagerly=False)

tanzhenyu on Jan 16, 2020

Thanks for the clarification, and for the quick help! model.predict_on_batch speeds things up, but is still significantly slower on the v2 path. Here is the cumsum using your profile snippet (using tf2.0.0, calling predict 100 times instead of 5) for a small DQN model on a small batch of data:

`experimental_run_tf_function`	`predict()`	`predict_on_batch()`
False / v1	0.209s	0.078s
True / v2 (default)	3.720s	0.246s

So, anyone with a single batch should switch to predict_on_batch, but an even faster option exists (v1 predict_on_batch), which is going to be deprecated if I understood correctly.

Is this use case truly so exotic that we simply should not use the Keras API for it? I understand that we should of course look into batching, but for anyone who just writes quick prototypes it would be nice to have a fast light-weight way of evaluating things. Anyway, warning single-batch users of predict() will surely help most users, so that sounds great.

Also, here’s a detail that may be important to people who migrate their code: When running in v2 mode, predict_on_batch will not return a numpy array (contrary to the docstring), but an EagerTensor instead. The caller may want to wrap the result – as in np.array(model.predict_on_batch(...)) – to guarantee the same behavior for both v1 and v2. predict however returns a numpy array in both cases. If you like, I can make a PR for the docstring.

ttbrunner on Dec 29, 2019