pytorch-lightning: continue training from checkpoint seems broken (high loss values), while reasonable with .eval()
I tried to load (my trained) model from checkpoint for a fine-tune training. on the first “on_val_step()” output seems OK, loss scale is same as at the end of pre-train. but on first “on_train_step()” output is totally different, very bad - just like it’s a “training from scratch”.
that behavior happens both when I:
-
stop a train in the middle, and then run the same train with “resume from checkpoint”
-
manually load a model from checkpoint after pre-training finished, as follow:
checkpoint = torch.load(config['pre_trained_weights_checkpoint'], map_location=lambda storage, loc: storage) experiment.load_state_dict(checkpoint['state_dict'])(where “experiment” my pl.LightningModule)
am I doing something wrong? what is the best practice for continue training a model from the last weights point is stopped in PL?
Thanks.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 17 (5 by maintainers)
I’m having the same problem. After resume_from_checkpoint, the loss is higher than the last step before checkpointing. Maybe the trainer does not resume learning rate?
@awaelchli Thanks for the respond!
Maybe I wasn’t precise enough - The problem is that, for the first forward even without training, (without doing any loss.backward() or optimizer.step()), I already have a loss that indicates that my model is garbage when it’s configured with model.train(). But everything is OK when I use model.eval() (For the exact same code, dataloader etc…). It’s like if the train() method was making my model completely useless. I guess it has something to do with the BN layers, (I used DDP with sync_batch_norm), but can’t really find the exact problem.
did anyone face the same issue?
I’m not sure if this is related, but i think i have the same issue here. I wanted to resume training today and the loss was a lot higher than when I finished training yesterday:
Disclaimer: I’ve just started using PyTorchLighning (thank you guys for that awesome framework!!), so perhaps I did something wrong. This is how i tried to continue the training.
@awaelchli Could you give an example how I’d load the correct hyperparameters? I’m using LR scheduling, and I thought the whole point of passing
resume_from_checkpointto the trainer was that the trainer would load the LR and other hyperparams from the checkpoint.