pytorch-lightning: continue training from checkpoint seems broken (high loss values), while reasonable with .eval()

I tried to load (my trained) model from checkpoint for a fine-tune training. on the first “on_val_step()” output seems OK, loss scale is same as at the end of pre-train. but on first “on_train_step()” output is totally different, very bad - just like it’s a “training from scratch”.

that behavior happens both when I:

stop a train in the middle, and then run the same train with “resume from checkpoint”
manually load a model from checkpoint after pre-training finished, as follow: checkpoint = torch.load(config['pre_trained_weights_checkpoint'], map_location=lambda storage, loc: storage) experiment.load_state_dict(checkpoint['state_dict']) (where “experiment” my pl.LightningModule)

am I doing something wrong? what is the best practice for continue training a model from the last weights point is stopped in PL?

Thanks.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 17 (5 by maintainers)

Most upvoted comments

I’m having the same problem. After resume_from_checkpoint, the loss is higher than the last step before checkpointing. Maybe the trainer does not resume learning rate?

chengjiali on Oct 15, 2021

@awaelchli Thanks for the respond!

Maybe I wasn’t precise enough - The problem is that, for the first forward even without training, (without doing any loss.backward() or optimizer.step()), I already have a loss that indicates that my model is garbage when it’s configured with model.train(). But everything is OK when I use model.eval() (For the exact same code, dataloader etc…). It’s like if the train() method was making my model completely useless. I guess it has something to do with the BN layers, (I used DDP with sync_batch_norm), but can’t really find the exact problem.

did anyone face the same issue?

yairkit on Oct 11, 2020

I’m not sure if this is related, but i think i have the same issue here. I wanted to resume training today and the loss was a lot higher than when I finished training yesterday:

Disclaimer: I’ve just started using PyTorchLighning (thank you guys for that awesome framework!!), so perhaps I did something wrong. This is how i tried to continue the training.

model = MyModel.load_from_checkpoint(chkpt)
trainer = pl.Trainer(resume_from_checkpoint=chkpt, gpus=parser.gpus, distributed_backend=parser.distributed_backend)
trainer.fit(model, datamodule=data)

@awaelchli Could you give an example how I’d load the correct hyperparameters? I’m using LR scheduling, and I thought the whole point of passing resume_from_checkpoint to the trainer was that the trainer would load the LR and other hyperparams from the checkpoint.

gergol on Oct 11, 2020