tensorflow: Understanding warning "5 out of the last 5 calls to triggered tf.function retracing"
System information
- Have I written custom code: Yes, below.
- Linux Ubuntu 16.04
- TensorFlow 2.1.0-dev20191103, binary install.
- Python 3.6
- CUDA 10.0 / cnDNN 7.6.4
- 4 * NVIDIA TITAN X (12GB)
I defined a very simple training script with a custom loss function and .fit()
as below. The loss_fn
is very simple and I think every time it takes tensors of the same shape and type. But I’m getting the following warning message. Interesting is that I’m getting the message only when training with multiple GPUs. Is it a bug? Is it harmful? Is it really affecting the computational cost?
Warning message:
WARNING:tensorflow:5 out of the last 5 calls to <function loss_fn at 0x7f0070ef0e18> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_re
lax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Code:
from __future__ import absolute_import, division, print_function, unicode_literals
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import (Conv2D, Conv3D, Dense)
@tf.function
def loss_fn(y_pred, y_true):
return tf.reduce_mean(tf.math.square(y_pred - y_true))
if __name__ == "__main__":
BATCH_SIZE_PER_SYNC = 4
strategy = tf.distribute.MirroredStrategy()
num_gpus = strategy.num_replicas_in_sync
global_batch_size = BATCH_SIZE_PER_SYNC * num_gpus
print('num GPUs: {}, global batch size: {}'.format(num_gpus, global_batch_size))
# fake data ------------------------------------------------------
fakea = np.random.rand(global_batch_size, 10, 200, 200, 128).astype(np.float32)
targets = np.random.rand(global_batch_size, 200, 200, 14).astype(np.float32)
fakea = tf.constant(fakea)
targets = tf.constant(targets)
# tf.Dataset ------------------------------------------------------
def gen():
while True:
yield (fakea, targets)
dataset = tf.data.Dataset.from_generator(gen,
(tf.float32, tf.float32),
(tf.TensorShape(fakea.shape), tf.TensorShape(targets.shape)))
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
# training ------------------------------------------------------
callbacks = [tf.keras.callbacks.TensorBoard(log_dir='./logs')]
training = True
with strategy.scope():
va = keras.Input(shape=(10, 200, 200, 128), dtype=tf.float32, name='va')
x = Conv3D(64, kernel_size=3, strides=1, padding='same')(va)
x = Conv3D(64, kernel_size=3, strides=1, padding='same')(x)
x = Conv3D(64, kernel_size=3, strides=1, padding='same')(x)
x = tf.reduce_max(x, axis=1, name='maxpool')
b = Conv2D(14, kernel_size=3, padding='same')(x)
model = keras.Model(inputs=va, outputs=b, name='net')
optimizer = keras.optimizers.RMSprop()
model.compile(optimizer=optimizer, loss=loss_fn)
model.fit(x=dataset, epochs=10, steps_per_epoch=100, callbacks=callbacks)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 11
- Comments: 44 (9 by maintainers)
When calling a simple RNN with input sequences of various lengths, it seems that the model gets traced once for each sequence length, as you can see in this Colab. Here’s the code:
I get the dreaded warning:
Perhaps there’s a way to force the model to use a single graph instead of tracing a new one for each input shape?
I used
model(x)
instead ofmodel.predict(x)
And it worked for meme too… how to solve this problem ?
tf.functions can only handle a pre defined input shape, if the shape changes, or if diferent python objects get passed, tensorflow automagically rebuilds the function.
so i would suggest you check and print all the input parameter that get passed to
Model.make_predict_function.<locals>.predict_function
if a input is no tensor or if the shape changes you have found the problemI was getting this same warning message. Using a LSTM model with variable sequence lengths. Upgrading from TF 2.2 to TF 2.3 seems to have fixed it.
I get the same error if trying to predict in a for-loop (due to variable length inputs).
I have to call
model.predict(np.expand_dims(xi, axis = 0))
on each sample individually, or tensorflow will attempt to concatenate predictions and fail.This is probably calling something several times, when it shouldn’t.
I am also getting this error on a classical functional model without
tf.function
and variable-length inputs in a for loop usingpredict
. This happens only in tf 2.2.0rc2, and as soon as I switch back to tf 2.1 the problem disappears.UPDATE: is it possible its only a problem with the:
keras.layers.SimpleRNN(10, return_sequences=True, input_shape=[None, 4])
Im not sure butreturn_sequences=TRUE
when im correct will return the hole RNN Sequence, not just the last element, so for different sized inputs you will get different sized outputs.Dens
needs a fixed sized input. So it will be retraced every time.when you try to use the same dense layer for every single output you cant do this with
Sequential
, because it would be a parallel operation. If this is the case i my have a solution for you…OLD: yes i see @ageron, had the same problem with Keras functional api, it seams the functional keras api has problems with variable length inputs, different than the batchdim
a workaround is using the layer class api and subclass Keras Layer. here you can update the call method with tf.function annotation with "relax_shape:
@tf.function( experimental_relax_shapes=True)
or explicit:
self.call = tf.function(self.call,input_signature=[(tf.TensorSpec([None, None, None, 3], dtype=tf.float32), (tf.TensorSpec([None], dtype=tf.float32), tf.TensorSpec([None, 15, 3], dtype=tf.float32)), tf.TensorSpec([], dtype=tf.float32), tf.TensorSpec([], dtype=tf.float32))])
in__init__
orbuild
But i agree the functional api should although support this.
@zaccharieramzi - yes please file a separate ticket as this is a Keras issue without distribution strategy. The error message might be the same but the root causes are different.
@zubaidah93 , Python 2 is officially dead, it is not supported anymore, there are no updates anymore, including security updates, so it is even unsafe to use it. You should really upgrade to Python 3. And TensorFlow 1.1.0 is several years old. If you want to use TensorFlow 1, you should install TensorFlow 1.15 instead. But TensorFlow 2 is much better.
Am using only 1 gpu and still am facing this retracing issue. After trying out verious fixes to this problem by searching over the internet, nothing worked. Including tf-nightly-gpu since its first of all not detecting my gpu. So I have just downgraded from tf 2.3.0 to tf 2.2.1 and it has fixed this issue for now.
I am also facing the same issue with Tensorflow 2.2.0-rc2 while training with variable lengths in a loop.