keras: Keras freezing on last batch of first epoch (can't move to second epoch)

I’m using Keras 2.1.1 and Tensorflow 1.4, Python 3.6, Windows 7.

I’m attempting transfer learning using the Inception model. The code is straight from the Keras Application API, just a few tweaks (using my data).

Here is the code

from keras.preprocessing import image```
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K
from keras import optimizers


img_width, img_height = 299, 299
train_data_dir = r'C:\Users\Moondra\Desktop\Keras Applications\data\train'
total_samples = 13581
batch_size = 3
epochs = 5


train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
zoom_range = 0.1,
rotation_range=15)



train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = 'categorical')  #class_mode = 'categorical'


# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(12, activation='softmax')(x)

# this is the model we will train
model = Model(input=base_model.input, output=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer=optimizers.SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# train the model on the new data for a few epochs
model.fit_generator(
train_generator,
steps_per_epoch = 20,
epochs = epochs)


# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.

# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 172 layers and unfreeze the rest:
for layer in model.layers[:249]:
   layer.trainable = False
for layer in model.layers[249:]:
   layer.trainable = True

# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy', metrics = ['accuracy'])

# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(
train_generator,
steps_per_epoch = 25,
epochs = epochs)`


Output is

Found 13581 images belonging to 12 classes.

Warning (from warnings module):
  File "C:\Users\Moondra\Desktop\Keras Applications\keras_transfer_learning_inception_problem_one_epoch.py", line 44
    model = Model(input=base_model.input, output=predictions)
UserWarning: Update your `Model` call to the Keras 2 API: `Model(inputs=Tensor("in..., outputs=Tensor("de...)`
Epoch 1/5

 1/20 [>.............................] - ETA: 38s - loss: 2.8652 - acc: 0.0000e+00
 3/20 [===>..........................] - ETA: 12s - loss: 2.6107 - acc: 0.1111    
 4/20 [=====>........................] - ETA: 8s - loss: 2.6454 - acc: 0.0833 
 5/20 [======>.......................] - ETA: 6s - loss: 2.6483 - acc: 0.0667
 6/20 [========>.....................] - ETA: 5s - loss: 2.6863 - acc: 0.0556
 7/20 [=========>....................] - ETA: 4s - loss: 2.6230 - acc: 0.0952
 8/20 [===========>..................] - ETA: 3s - loss: 2.6212 - acc: 0.0833
 9/20 [============>.................] - ETA: 3s - loss: 2.6192 - acc: 0.1111
10/20 [==============>...............] - ETA: 2s - loss: 2.6223 - acc: 0.1000
11/20 [===============>..............] - ETA: 2s - loss: 2.6626 - acc: 0.0909
12/20 [=================>............] - ETA: 2s - loss: 2.6562 - acc: 0.1111
13/20 [==================>...........] - ETA: 1s - loss: 2.6436 - acc: 0.1282
14/20 [====================>.........] - ETA: 1s - loss: 2.6319 - acc: 0.1190
15/20 [=====================>........] - ETA: 1s - loss: 2.6343 - acc: 0.1111
Warning (from warnings module):
  File "C:\Users\Moondra\AppData\Local\Programs\Python\Python36\lib\site-packages\keras\callbacks.py", line 116
    % delta_t_median)
UserWarning: Method on_batch_end() is slow compared to the batch update (0.102000). Check your callbacks.

16/20 [=======================>......] - ETA: 0s - loss: 2.6310 - acc: 0.1042
17/20 [========================>.....] - ETA: 0s - loss: 2.6207 - acc: 0.1176
18/20 [==========================>...] - ETA: 0s - loss: 2.6063 - acc: 0.1296
19/20 [===========================>..] - ETA: 0s - loss: 2.6056 - acc: 0.1228




It just hangs at the 19/20.

I already asked on stack overflow but no help.

https://stackoverflow.com/questions/47382952/cant-get-past-first-epoch-just-hangs-keras-transfer-learning-inception


About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 27
  • Comments: 71 (5 by maintainers)

Commits related to this issue

Most upvoted comments

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that: steps_per_epoch = number of train samples//batch_size and validation_steps = number of validation samples//batch_size

@minaMagedNaeem:same with @oliran, i have the same issue and resolve it after setting validation_steps=validation_size//batch_size

history_ft = model.fit_generator( generator_train,#可自定义 samples_per_epoch=4170, # nb_train_samples # steps_per_epoch=10, # nb_train_samples#每轮epoch遍历的samples validation_data=generator_test,#可自定义 nb_epoch=10, # verbose=0, validation_steps=530//64, # epochs=100 # nb_val_samples=530 )

Changing validation_steps=validation_size//batch_size worked for me

This happens because you are giving validation data to Keras, through a parameter in model.fit or model.fit_generator.

After each epoch, Keras takes the validation data and evaluates the model on this data, which implies one forward pass for each validation data point, which might take a lot of time and might seem that Keras is stuck, but it is necessary when training a model.

NikeNano - make sure that your validation_steps is reasonable. I had a similar problem, but turns out I forgot to divide by batch_size.

I also have the same issue, where first epoch hangs on the last step. Using the latest Keras, gpu, python 3.5, windows 10

This is likely due to changes in keras/utils/data_utils.py between 2.0.9 and 2.1.0. Specifically this: https://github.com/fchollet/keras/commit/612f5307b962fb140106efcc50932c292630fda3#diff-ba9d38600a2df565e5ae8757eb2b1b35

@Dref360 please take a look, this seems like a serious issue.

This worked for me:

  1. set workers=1, and use_multiprocessing=False in self.keras_model.fit_generator in model.py
  2. Make sure that: steps_per_epoch = number of train samples//batch_size and validation_steps = number of validation samples//batch_size

This response helped me solve the issue. Especially, changes to workers and use_multiprocessing.

same here. i have this problem with the code from Deep Learning with Python Listing 6.37 I am on Ubuntu 18.04 with keras 2.1.6, tensorflow-gpu 1.8.0

On Keras 2.2.4 I noticed that if I remove the validation_data generator argument from the fit_generator() call it does get past. I haven’t investigated yet if it is a bug on my side or not. Hope this helps.

This problem can also occur when the path to valiation data is invalid, which is actually my case. I have two seperated directories for training and validation. However, the path to my validation set is incorrect. So at the end of the epoch, Keras could not load the validation data and it got freezed.

I think it would be better if Keras can promp an Error like File not found or something like that.

Experiencing the same with Keras 2.2.0 Tensorflow 1.8 on Ubuntu 16.04 .

I had a similar issue with python3, keras v2.1.6, tensorflow v1.8.0, ubuntu 16.04. I interrupted the processing and was able to see that was busy running self.sess.run([self.merged], feed_dict=feed_dict) in keras/callbacks.py. I guessed that it was related to histogram computations in TensorBoard. So, I set histogram_freq=0 on TensorBoard object creation. And, for me it solved the issue, at the cost of loosing TensorBoard histograms. I had previous versions of keras and tensorflow for which the histogram computation for tensorboard did not take such a huge time (unfortunately I do not recall for which versions it was ok).

I agree that this is still very much an issue. However, depending on your setup there may be a workaround. I think the problem is that the validation_steps parameter is being ignored by keras. Keras is instead using the length returned by the generator to determine how many batches should be run per epoch for the validation set. Since I am using a custom generator, I simply changed the __len__ function to return the value I would have placed as validation_steps

While this workaround works for me, keras should definitely look into resolving this issue

to confirm vadapalliravikumar experience, if I remove the generator’s inheritance from keras.utils.sequence or use multi-processing=false, it works fine because single-thread. Therefore, it seems like a race condition when multi-processing.

I think there is bug with Imagedatagenerator. If I load my images from h5py using model.train_on_batch I have no problems.

@leon-kwy Did you ever figure it out? I’m having the exact same problem with Mask RCNN.

I’ve been stuck at this issue for like a day, but I found my elegant fix with this.

#train_generator = ...
#val_generator = ...
history = model.fit(
        train_generator,
        epochs=200,
        validation_data=val_generator,
        use_multiprocessing=True,
        workers=16,
        steps_per_epoch= train_generator.samples//train_generator.batch_size,  ######  Here
        validation_steps= val_generator.samples//val_generator.batch_size,   ##### Here
        callbacks=callbacks
        )

The key for me is to defined validation_steps & steps_per_epoch by the samples & batch-size variables within the generator, so there won’t be any discrepancies or mistake.

I also solved this by removing the validation process entirely. I use Ubuntu18.4 LTS, cuda10.0 cudnn7.6, keras2.3.1, and tesoflow 1.14.

I confirm the valid_generator was the problem. The problem was gone after I had turned it off. But if the validation set is big, I still need the method. I would appreciate if the Keras team can help with this!

i have updated keras and i am still running into the same issues

dividing validation_steps by batch_size solved it for me

Same here… However, it works when i remove validation_data=validation_generator.

Could you all update to master / 2.1.2 please? Pretty sure this has been fixed with : https://github.com/fchollet/keras/commit/2f3edf96078d78450b985bdf3bfffe7e0c627169#diff-299cfd5886683a4b012f286403769fc1