keras: Not able to resume training after loading model + weights

I work at an institute where it is not allowed to run a workstation overnight, hence I had to split the training process into multiple days. I trained a model for 10 epochs which took approximately 1 day, and saved the model + weights using the methods described in keras documentation like this:

 modelPath = './SegmentationModels/
 modelName = 'Arch_1_10'
 sys.setrecursionlimit(10000)
 json_string = model.to_json()
 open(str(modelPath + modelName + '.json'), 'w').write(json_string)
 model.save_weights(str(modelPath + modelName + '.h5'))
 import cPickle as pickle
 with open(str(modelPath + modelName + '_hist.pckl'), 'wb') as f:
     pickle.dump(history.history, f, -1)

and load the model the next day like this:

 modelPath = './SegmentationModels/'
 modelName = 'Arch_1_10'
 model = model_from_json(open(str(modelPath + modelName + '.json')).read())
 model.compile(loss='categorical_crossentropy', optimizer=optim_sgd)
 model.load_weights(str(modelPath + modelName + '.h5'))
 #     import cPickle as pickle
 #     with open(str(modelPath + modelName + '_hist.pckl'), 'r') as f:
 #         history = pickle.load(f)
 model.summary()

but when I restarted the training process it initialized to the same training and validation loss that I had got the earlier day at the 1st epoch! It should have started with an accuracy of 60% which was the last best accuracy I got the earlier day, but it doesn’t.

I have also tried to call model.compile() before and after load_weights, as well as leaving it out altogether, but that doesn’t work either.

Please help me in this regard. Thanks in advance.

About this issue

Original URL
State: closed
Created 8 years ago
Reactions: 9
Comments: 90 (11 by maintainers)

Commits related to this issue

Save-Load issues on keras models https://github.com/fchollet/keras/issues/2378 — committed to Alexander-Minyushkin/aistreamer by deleted user 7 years ago

Most upvoted comments

Guys,

I fixed the problem by reducing the learning rate to 1e-5 (small lr for Adam) when I fine tune my pretrained model, which has been trained using Adadelta with a much higher starting lr. I think, the issue is the starting lr for Adam that mess things up. For fine tuing, just use a small lr for a new optimiser of your choice… Hope this helps.

+16

peymanrah on Jun 3, 2018

Nope. It doesn’t. Still starts with 20% accuracy as it did on the 1st epoch.

trane293 on Apr 18, 2016

I think this issue has been fixed. Use model.save() to save the model, and import the saved model using:

from keras.models import load_model
model = load_model('my_model.h5')

If you want to resume training it should work as this function saves all the optimiser information as well.

trane293 on Jan 2, 2018

Doesn’t print. The weights are loaded successfully I suppose. Its the training procedure that’s problematic. After running this script (it didn’t print anything), I ran model.fit() and it started with a loss 10x higher than originally it was at epoch 1, and with accuracy 20% again sigh

trane293 on Apr 19, 2016

It is sad that such a basic issue has still not been solved.

It is happening because model.save(filename.h5) does not save the state of the optimizer. So the optimizers like Adam, RMSProp does not work but SGD works as mentioned in one of the previous comments (I verified this) since it is stateless optimizer (learning rate is fixed).

This is just sad that such a popular library has such basic/glaring/trivial bugs/problems 😦

bnaman50 on Dec 19, 2018

Got the same issue. Fuck it, It’s not solved. Spent 18 hours training a DenseNet on AWS to get to 89% accuracy on Cifar10, the connection interrupted but I thought I was safe because I had my model saved every 30 epochs. The truth is that it works for model.test(), but when I try model.fit(), it breaks and reverts to 10% accuracy when it was 89%. I’ve lost 1 day of work due to this shitty issue.

hypnopump on Dec 1, 2017

Wait, I’ve been -extremely- stupid, please ignore.

For the interested, I was making a character-prediction RNN with a one-hot character encoding, but instead of pickling the map of characters to one-hot indices I was generating it in the code each time from a set of allowed characters using enumerate(). This of course meant that the mapping generated by enumerate() was different every time, because sets have no guaranteed order, which explains why everything worked fine until I restarted the script (and so regenerated the mapping).

This is embarrassingly obvious in retrospect.

Rocketknight1 on May 21, 2016

UPDATE: Came to the institute this morning, built the model using original code and loaded the model weights saved using ModelCheckpoint callback. Started training and it still restarts from the beginning; no memory of past metrics. The performance is actually even worse than it was earlier when it started training the first epoch. In my case, normally the network starts at 20% accuracy and goes to around 70% in 60 epochs. But when I restart the training process using loaded weights, the network starts at 20% on epoch 1 and keeps going lower and lower until 16% at epoch 5. I have no idea what’s happening here.

UPDATE 2: When I try to evaluate the loaded model + weights on the same validation data, I get 60% accuracy, as intended. But if I do model.fit() then training starts from 20% and oscillates on it. So I can confirm that the weights are being loaded correctly since the model can make predictions, but the model is not able to retrain.

Please help! @NasenSpray

trane293 on Apr 19, 2016

It did

That’s it, save_weights() doesn’t overwrite existing files unless you also pass overwrite=True. It should have asked for user input, though.

NasenSpray on Apr 18, 2016

@Rocketknight1 Thanks, your posts made me aware I was doing the same thing. A lot of people might have this issue because the code referenced in

https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/

gets exactly this wrong. This code section in RNN_utils.py

data = open(DATA_DIR, 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)

should be something like

data = open(data_dir, 'r').read()
chars = list(set(data))
chars.sort() # SORT THE CHARS so mapping is the same even when restarting the script!
VOCAB_SIZE = len(chars)

instead so that the char mapping is always the same when reading the same file in a new session.

phugen on Sep 8, 2017

Run this plz

model  = make_model()
w1 = model.get_weights()
model.load_weights('your_saved_weights.h5')
w2 = model.get_weights()

for a,b in zip(w1, w2):
  if np.all(a == b):
    print "wtf is happening"

Does it print?

NasenSpray on Apr 19, 2016

hi,

I think I found an answer in a different post. It’s about the implementation of Adam and RMSProp etc in Tensorflow. When the model finds good weights on the first day it creates small losses, and Adam and other optimizers with a given learning rate ignores the previously adapted (and probably smaller) learning rate, and restarts the learning with the issue described below (basically: low errors are handled with a small epsilon). So saving the adapted learning rate could help too.

https://github.com/ibab/tensorflow-wavenet/issues/143

let me quote here:

Explanation

I’ve seen the behavior Zeta36 is describing in our test/test_model.py. When that test uses adam or rmsprop, I would see the loss drop and drop till a small number, then jump up to a large loss at some random time.

You can reproduce that problematic behavior if you change (what at the time of this writing in master) MomentumOptimizer to AdamOptimizer, make the learning rate 0.002 and delete the momentum parameter. Uncomment the statements that print loss.

If you run the test with

python test/test_model.py

every second or third time or so that you run the test you will see the loss will drop and then at some point jump up to a larger value, and sometimes cause the test to fail. I “worked around” that problem by futzing with the learn rate and number of training iterations we run in the test until it would reliably pass.

Anyway, I think I’ve found the cause. If you look at the tensorflow implementations of rmsprop and adam you will see they compute the change to a weight by dividing by a sinister lag-filtered rate-of-change or error magnitude. When the error or rate of change of error gets small, or even zero, near the bottom of the error basin, then the denominator gets close to zero. The only thing saving us from a NaN or Inf is they add an epsilon in the denominator. That epsilon defaults to 1e-10 for rmsprop and 1e-8 for adam. That’s enough to make the change to our parameter a big number, presumably big enough give us a large loss.

So in PR 128 I specified a larger epsilon for rmsprop, and in PR 147 for adam. I found that these changes fix the problem of randomly increasing loss during the tests in test_model.py.

guyko81 on Dec 7, 2017

Hello everyone, I faced the same problem. But I think it’s solved in my case. First save the model and weights as in the code below…

#Save the final model model_json = model.to_json() mdl_save_path = ‘model.json’ with open(mdl_save_path, “w”) as json_file: json_file.write(model_json) ###serialize weights to HDF5 mdl_wght_save_path = ‘model.h5’ model.save_weights(mdl_wght_save_path)

Then I started another session completely closing all python opened files for retraining.I also tried this by checkpointing the model while training. At the time of resuming training, I first loaded the model architecture from .json file and then loaded the weights using load_weights() from .h5 file. Then compiled the model using model.compile() and fit it with model.fit().

N.B: I used SGD at both times while training and resuming training…It worked…

Though I did not check this with other optimizers. I saw that at the time of retraining if I use other optimizers other than SGD(I used SGD in normal training), the issue persists. So, I am pretty confident using different optimizers during normal training and resumed training will cause you a problem.

NahianHasan on Apr 30, 2018

This is what I am using (took from keras docs) and it works without a problem on Keras 1.0:

def load_model():
    model = model_from_json(open('model.json').read())
    model.load_weights('weights.h5')
    model.compile(optimizer=rmsprop, loss='mse')
    return model


def save_model(model):    
    json_string = model.to_json()
    open('model.json', 'w').write(json_string)
    model.save_weights('weights.h5', overwrite=True)

I had one example with say 10 epochs and another example with save and load in a loop of 10 iterations each with 1 epoch, and the loss for both were similarly decreasing. Additionally both resulting models were predicting fine.

Have you tried to call model.load_weights before model.compile?

rtatishvili on Apr 27, 2016

Actually sorry for my last comment, all the architectures I save and all weights I save have unique names, and yes I know save_weights() asks for user input when overwriting the file, but in my case it doesn’t since the files do not exist. Se we can safely rule out the possibility that the file was not overwritten.

trane293 on Apr 18, 2016

After loading your weights, when you train your model set parameter initial_epoch to the last epoch you trained your model before. E.g. you trained your model 100 epochs and saved via ModelCheckpoint weights after each epoch and want to resume training from 101st epoch you should do it in the next way model.load_weights('path_to_the_last_weights_file') model.fit(initial_epoch=100) Other parameters keep the same.

turb0bur on Jun 3, 2019

I experienced this issue with Keras both with Mxnet and Tensorflow backends. My solution was to switch from using keras to tensorflow.keras. This obviously only works with tensorflow backend. However, if you are already using tensorflow backend, it is just a matter of changing your import statements as the functionality of tensorflow.keras is almost identical to keras Since switching I have not experienced this annoying bug

macmatt22 on Mar 11, 2019

I have the same problem on keras 1.2.0. It was fixed on 1.2.1.

liuaifu on Jan 20, 2017

Does this also apply to partially pretrained models? For example if you have a network with 5 convolutional layers and you take the weights for the first 3 layers from a pretrained network (transfer learning) and set trainable=False for those layers?

Concerning your question: As I wrote, I’m new to Keras and Deep Learning. I’m trying to get a feel for different techniques, so I’m playing around a bit and observing the resulting effects and trying to understand the behaviour.

thomasgolda on Aug 10, 2016

It generally not advisable to retrain a pretrained model on an altogether different optimizer compared to what it was trained on. This just doesn’t make any sense. My question is - do you have a valid reason behind this setting, where you want to train a pre-trained network using a different optimizer like RMSProp or Adam?

On Tue, Aug 9, 2016 at 6:17 PM, Thomas notifications@github.com wrote:

Heyho. I’m new to Deep Learning and Keras and ran into the same / a similar issue. I trained my model with SGD for some time and saved the weights after each epoch using the save_weights() function. When I load weights from a particular epoch and I use SGD again, everything is fine (evaluation metrics are still good).

Additionally, I tried to use my already learned weights but use a different optimizer for further training. When choosing Adam, Adagrad or RMSprop the evaluation metrics dropped and it looked like as if the learning started from scratch.

How can this happen? Why is everything fine, when I use SGD again - even with changed learning rate - but not when using a different Optimizer?

Thanks for your help!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/2378#issuecomment-238541753, or mute the thread https://github.com/notifications/unsubscribe-auth/AJzCfX6axe8wAHs_UFXuAcreh9gLx69eks5qeHbcgaJpZM4IJZgO .

trane293 on Aug 9, 2016

@carlthome Had the same problem. Didn’t check recently for the current status but now I use vanilla cPickle to pickle my trained model. Loading the pickled model and resuming training seems to be working just as expected. However I’m not sure about the JSON + h5 weight saving/loading functionality. If you are having the same problem then there must be something wrong.

trane293 on May 15, 2016

Grasping at straws here, but some optimizers are stateful right? Are you just using SGD? I’m not familiar with this part of Keras but perhaps the optimizers should be saved as well because otherwise when you reinitiate learning and start a new epoch but with pretrained weights instead of your original weight initialization, perhaps training diverges due to high learning rates.

carlthome on Apr 19, 2016