tensorflow: Significant prediction slowdown after model.compile()
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
- TensorFlow installed from (source or binary):
pip install tensorflow - TensorFlow version: 2.0.0
- Python version: 3.7
- CUDA/cuDNN version: CUDA=10.0, cuDNN=7.6.4
- GPU model and memory: GTX 1060 6GB
Describe the current behavior
The prediction speed is slowed down a lot after model.compile() call.
Describe the expected behavior Speed should not be affected. Predict function is used by users assuming that it will work fast because we use it all the time in production. It should not cause surprise to users.
Code to reproduce the issue https://nbviewer.jupyter.org/github/off99555/TensorFlowExperiments/blob/master/test-prediction-speed-after-compile.ipynb?flush_cache=true

About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 9
- Comments: 25 (16 by maintainers)
Links to this issue
Commits related to this issue
- Update docstring for model.predict, to advertise users to model call if performance is a concern when input is small. Detailed info see #33340. PiperOrigin-RevId: 289782248 Change-Id: Ibec02ae1126896... — committed to tensorflow/tensorflow by tanzhenyu 4 years ago
- Merge branch 'master' of github.com:tensorflow/tensorflow * 'master' of github.com:tensorflow/tensorflow: (139 commits) [TF:XLA] Enable depthwise convs with depthwise multiplier to use batch_group... — committed to andrewxcav/tensorflow by andrewxcav 4 years ago
Relevant SO, and another minimal reproducible example:
Outputs:
That’s a 30-fold slowdown. Worse yet, building the optimizer does not elicit any further slowdowns - so “graph size” may not be the main explanation here.
It does not seem to me that it is resolved. It’s more like we know how the issue occurs but we don’t have a solution, just workaround. I need to compare timing between compiled and non-compiled version and see which is faster. But I don’t think a user will be aware of this in general. So we should make a better solution. In this case, should I close the issue or keep it open?
@ttbrunner Ah yeah that was the commit, thanks.
Ok so we follow the adapter pattern for convert numpy and dataframes to dataset first, and has a single path for execution. Apparently the speed down is mainly two things: 1) the construction of dataset. 2) creating tf.function for predict. (Check TensorLikeDataAdapter under /python/keras/engine/data_adapter.py if you’re interested)
@off99555 @OverLordGoldDragon @ttbrunner So here’s what I would recommend going forward:
you can predict the output using model call, not model predict, i.e., calling model(x) would make this much faster because there are no “conversion to dataset” part, and also it’s directly calling a cached tf.function. However be aware that if you have batch norm layers or any other layers that behaves differently between training and inference, make sure to call it with model(x, training=False)
I will make a docstring to recommend users to use model call and explain predict is for large dataset.
SG?
Hi. Let me try to address some of the questions here and see if that helps.
experimental_run_tf_functionis an implementation detail, and that flag is mostly there as a debug during the transition. We don’t plan to document it because it will be removed at some point in the future and theTruebehavior will be the only behavior.Now I expect that you may be surprised (or aghast) that it’s going to be always on given the discussion in this thread. What
experimental_run_tf_functiondoes is funnel all calls tofit,evaluate, andpredictthrough a central adapter which creates a Dataset and performs a variety of checks and input validation. This is generally desirable because it makes everything more robust, but there is some overhead to spinning up this machinery which is not amortized by small models with little data.Code to profile the step:
In this case most of the extra time is spent creating the dataset; there is machinery in there which makes sense, but it’s surplus to requirements for the degenerate case of a single batch. (@jsimsa in case you want to look into the init time, but it’s not obvious that it’s unreasonable given the pipeline.) Really, this is not what
model.predictis for. That endpoint is for predictions on lots of data where the batching and aggregation machinery in that endpoint is necessary. For single batch prediction there ismodel.predict_on_batch, which doesn’t invoke all of that machinery and just directly calls into the model function. I tested it, and it is identical in v1 and v2. (And faster than even v1model.predict)However this seems to be a common pitfall; I see lots of issues around single batch
model.predict. (@martinwicke increment your counter…) From a documentation standpoint, I think the most valuable contribution would be to documentmodel.predict_on_batchin themodel.predictdocstring, and probably also warn inmodel.predictwhen the batch cardinality is one. @ymodak @jvishnuvardhan Can you remind me to bring this up at the next triage? And @OverLordGoldDragon if you want to take a crack at a PR that would be great; feel free to tag me and I’ll try to provide some assistance.I have updated the doc, also tested the performance for model(x) in nightly. Closing it for now. Thanks all for reporting and collaborative work!
@off99555 I’d agree to request a documentation improvement from TensorFlow to notify users of this, but I doubt any code-level changes will be implemented to address this as it’d require revamping a massive portion of TF graph. It’s up to the user to be aware of functionality differences and adjust accordingly - but admittedly, while this isn’t the only issue where a workaround is required, other cases are at least documented.
@off99555 , Can you confirm if the issue is resolved?Thanks!
You can also compile(…, run_eagerly=False)
Thanks for the clarification, and for the quick help!
model.predict_on_batchspeeds things up, but is still significantly slower on the v2 path. Here is the cumsum using your profile snippet (using tf2.0.0, calling predict 100 times instead of 5) for a small DQN model on a small batch of data:experimental_run_tf_functionpredict()predict_on_batch()So, anyone with a single batch should switch to
predict_on_batch, but an even faster option exists (v1predict_on_batch), which is going to be deprecated if I understood correctly.Is this use case truly so exotic that we simply should not use the Keras API for it? I understand that we should of course look into batching, but for anyone who just writes quick prototypes it would be nice to have a fast light-weight way of evaluating things. Anyway, warning single-batch users of
predict()will surely help most users, so that sounds great.Also, here’s a detail that may be important to people who migrate their code: When running in v2 mode,
predict_on_batchwill not return a numpy array (contrary to the docstring), but anEagerTensorinstead. The caller may want to wrap the result – as innp.array(model.predict_on_batch(...))– to guarantee the same behavior for both v1 and v2.predicthowever returns a numpy array in both cases. If you like, I can make a PR for the docstring.