tensorflow: Tensorflow 2.0 does not iterate through entire dataset when tf.keras.model.fit is called

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS 10.14.6
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): 2.0
Python version: 3.6.0

Describe the current behavior I am training a model in tf.keras with tensorflow 2.0. I am having an issue where my model appears to train successfully, but it is not iterating through the entire dataset. A few things indicate that it’s not going through the entire dataset:

the model is training super fast (an epoch should take ~45 minutes, which is estimated, but it always finishes in < 1 minute),
the validation metrics are always reported 0. and,
the progress bar that prints out never fills up, and it stops at ~1/1000 of the data.

The progress bar during training looks something like this:

Epoch 1/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300   186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00

I restructured the code into tensorflow 1.15, and I do not have this issue. When I call tf.compat.v1.enable_v2_behavior(), I see this behavior again. There are no errors, warnings, or info that is reported, it just stops iterating through my dataset early. I am following this tutorial for Multiple Input Series.

Describe the expected behavior The expected behavior is that an epoch will completely go through the dataset, and it will report a reasonable validation loss, and it will take some time to go through the entire dataset. I see the correct behavior in TF 1.15, and the epochs take ~45 minutes to complete (as expected), the validation metrics are calculated, and the progress bar looks something like this (which is nothing special 😃 )

Epoch 16/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2883e-05
162636/162636 [==============================] - 2946s 1ms/sample - loss: 1.2883e-05 - val_loss: 1.5680e-05
Epoch 17/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2631e-05
162636/162636 [==============================] - 2688s 5ms/sample - loss: 1.2633e-05 - val_loss: 2.1342e-05

Code to reproduce the issue I have a time-series dataset. It is very small so I am able to load it into memory, so I do not need the dataset API. I am windowing the time-series to produce two arrays, X and Y, and it looks something like this,

X=[
   [[1,2,3],[4,5,6],   [7,8,9]],
   [[4,5,6],[7,8,9],   [10,11,12]],
   [[7,8,9],[10,11,12],[13,14,15]],
   ...
  ] 
Y = [
     [4],
     [7],
     [10],
     ...
    ]

(yes, I realize that I could just as easily only include one of the features. I’ve tried that, i.e. `X=[[[1,2,3]], [[4,5,6]], [[7,8,9]], …], and it still doesn’t work)

Then, I build my model:

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

and then I train it:

model.fit([X],[Y],num_epochs=300,validation_split=0.2)

It correctly reports the number of train and validation samples, and then the progress bar pops up… but that’s where the success stops. The val_loss and val_mean_squared_error is always 0, for every epoch, and it appears to never train more than a fraction (~1/1000) of my dataset, although that fraction varies slightly between epochs. This is the print out:

Epoch 1/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300   186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00

When I call tf.compat.v1.enable_v2_behavior() in TF 1.15, the behavior is the same as TF 2.0.

Other info / logs Here is a link to the StackOverflow question that I proposed before I had confirmed that this is a TensorFlow bug

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 2
Comments: 16 (5 by maintainers)

Most upvoted comments

@michaelarfreed

Looks like code is incomplete.Please, help us with simple stand alone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks!

ravikyram on Nov 26, 2019