tensorflow: Tensorflow 2.0 does not iterate through entire dataset when tf.keras.model.fit is called
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS 10.14.6
- TensorFlow installed from (source or binary): source
- TensorFlow version (use command below): 2.0
- Python version: 3.6.0
Describe the current behavior I am training a model in tf.keras with tensorflow 2.0. I am having an issue where my model appears to train successfully, but it is not iterating through the entire dataset. A few things indicate that it’s not going through the entire dataset:
- the model is training super fast (an epoch should take ~45 minutes, which is estimated, but it always finishes in < 1 minute),
- the validation metrics are always reported 0. and,
- the progress bar that prints out never fills up, and it stops at ~1/1000 of the data.
The progress bar during training looks something like this:
Epoch 1/300 192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300 186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300 192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
I restructured the code into tensorflow 1.15, and I do not have this issue. When I call tf.compat.v1.enable_v2_behavior(), I see this behavior again. There are no errors, warnings, or info that is reported, it just stops iterating through my dataset early. I am following this tutorial for Multiple Input Series.
Describe the expected behavior The expected behavior is that an epoch will completely go through the dataset, and it will report a reasonable validation loss, and it will take some time to go through the entire dataset. I see the correct behavior in TF 1.15, and the epochs take ~45 minutes to complete (as expected), the validation metrics are calculated, and the progress bar looks something like this (which is nothing special 😃 )
Epoch 16/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2883e-05
162636/162636 [==============================] - 2946s 1ms/sample - loss: 1.2883e-05 - val_loss: 1.5680e-05
Epoch 17/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2631e-05
162636/162636 [==============================] - 2688s 5ms/sample - loss: 1.2633e-05 - val_loss: 2.1342e-05
Code to reproduce the issue I have a time-series dataset. It is very small so I am able to load it into memory, so I do not need the dataset API. I am windowing the time-series to produce two arrays, X and Y, and it looks something like this,
X=[
[[1,2,3],[4,5,6], [7,8,9]],
[[4,5,6],[7,8,9], [10,11,12]],
[[7,8,9],[10,11,12],[13,14,15]],
...
]
Y = [
[4],
[7],
[10],
...
]
(yes, I realize that I could just as easily only include one of the features. I’ve tried that, i.e. `X=[[[1,2,3]], [[4,5,6]], [[7,8,9]], …], and it still doesn’t work)
Then, I build my model:
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
and then I train it:
model.fit([X],[Y],num_epochs=300,validation_split=0.2)
It correctly reports the number of train and validation samples, and then the progress bar pops up… but that’s where the success stops. The val_loss and val_mean_squared_error is always 0, for every epoch, and it appears to never train more than a fraction (~1/1000) of my dataset, although that fraction varies slightly between epochs. This is the print out:
Epoch 1/300 192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300 186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300 192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
When I call tf.compat.v1.enable_v2_behavior() in TF 1.15, the behavior is the same as TF 2.0.
Other info / logs Here is a link to the StackOverflow question that I proposed before I had confirmed that this is a TensorFlow bug
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 2
- Comments: 16 (5 by maintainers)
@michaelarfreed
Looks like code is incomplete.Please, help us with simple stand alone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks!