tensorflow: Tensorflow 2.0 does not iterate through entire dataset when tf.keras.model.fit is called

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Mac OS 10.14.6
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): 2.0
  • Python version: 3.6.0

Describe the current behavior I am training a model in tf.keras with tensorflow 2.0. I am having an issue where my model appears to train successfully, but it is not iterating through the entire dataset. A few things indicate that it’s not going through the entire dataset:

  1. the model is training super fast (an epoch should take ~45 minutes, which is estimated, but it always finishes in < 1 minute),
  2. the validation metrics are always reported 0. and,
  3. the progress bar that prints out never fills up, and it stops at ~1/1000 of the data.

The progress bar during training looks something like this:

Epoch 1/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300   186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00

I restructured the code into tensorflow 1.15, and I do not have this issue. When I call tf.compat.v1.enable_v2_behavior(), I see this behavior again. There are no errors, warnings, or info that is reported, it just stops iterating through my dataset early. I am following this tutorial for Multiple Input Series.

Describe the expected behavior The expected behavior is that an epoch will completely go through the dataset, and it will report a reasonable validation loss, and it will take some time to go through the entire dataset. I see the correct behavior in TF 1.15, and the epochs take ~45 minutes to complete (as expected), the validation metrics are calculated, and the progress bar looks something like this (which is nothing special 😃 )

Epoch 16/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2883e-05
162636/162636 [==============================] - 2946s 1ms/sample - loss: 1.2883e-05 - val_loss: 1.5680e-05
Epoch 17/300
162605/162636 [============================>.] - ETA: 0s - loss: 1.2631e-05
162636/162636 [==============================] - 2688s 5ms/sample - loss: 1.2633e-05 - val_loss: 2.1342e-05

Code to reproduce the issue I have a time-series dataset. It is very small so I am able to load it into memory, so I do not need the dataset API. I am windowing the time-series to produce two arrays, X and Y, and it looks something like this,

X=[
   [[1,2,3],[4,5,6],   [7,8,9]],
   [[4,5,6],[7,8,9],   [10,11,12]],
   [[7,8,9],[10,11,12],[13,14,15]],
   ...
  ] 
Y = [
     [4],
     [7],
     [10],
     ...
    ]

(yes, I realize that I could just as easily only include one of the features. I’ve tried that, i.e. `X=[[[1,2,3]], [[4,5,6]], [[7,8,9]], …], and it still doesn’t work)

Then, I build my model:

model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

and then I train it:

model.fit([X],[Y],num_epochs=300,validation_split=0.2)

It correctly reports the number of train and validation samples, and then the progress bar pops up… but that’s where the success stops. The val_loss and val_mean_squared_error is always 0, for every epoch, and it appears to never train more than a fraction (~1/1000) of my dataset, although that fraction varies slightly between epochs. This is the print out:

Epoch 1/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
Epoch 2/300   186/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00
...
Epoch X/300   192/162636 [..............................] - ETA: 45:42 - loss: 0.4783 - mean_squared_error: 0.4783 - val_loss: 0.0000e+00 - val_mean_squared_error: 0.0000e+00

When I call tf.compat.v1.enable_v2_behavior() in TF 1.15, the behavior is the same as TF 2.0.

Other info / logs Here is a link to the StackOverflow question that I proposed before I had confirmed that this is a TensorFlow bug

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 16 (5 by maintainers)

Most upvoted comments

@michaelarfreed

Looks like code is incomplete.Please, help us with simple stand alone code to reproduce the issue in our environment. It helps us in localizing the issue faster. Thanks!