ignite: TrainsSaver doesn't respect Checkpoint's n_saved

🐛 Bug description

As the title says, it seems that TrainsSaver bypasses the Checkpoint n_saved parameter. That means that all models are saved and never updated / deleted.

Consider this simple example:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

The above saves every checkpoint. You end-up with

epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...

Now if we do, the same with DiskSaver:

        task.phases['train'].add_event_handler(
            Events.EPOCH_COMPLETED(every=1),
            Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
                       global_step_transform=global_step_from_engine(task.phases['train'])))

We get only:

epoch_checkpoint_3.pt

as expected.

Same behaviour if we save only best models using score_function, i.e. TrainsSaver saves every best model.

Environment

PyTorch Version: 1.3.1
Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
OS: Linux
How you installed Ignite: pip nightly
Python version: 3.6
Any other relevant information: trains version: 0.14.3

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 34 (17 by maintainers)

Commits related to this issue

Change callback structure, access thorough designated class WeightsFileHandler.ModelInfo See https://github.com/pytorch/ignite/issues/1056 — committed to allegroai/clearml by deleted user 4 years ago

Most upvoted comments

Thanks @vfdev-5 ! kudos for the quick PR! We will add it into TrainsServer (probably needs to add a bit of support in Trains as well) I’ll update here once the PR is ready

bmartinn on May 23, 2020

I was also thinking about if we can pass more info to save_handler in addition to object_to_save and filename: https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21

For example, we can opt to pass some meta-info about the checkpoint to save:

class BaseSaveHandler(metaclass=ABCMeta):
    """Base class for save handlers"""

    @abstractmethod
    def __call__(self, checkpoint: Mapping, filename: str, metadata=None) -> None:
        pass

and in metadata we can pass prefix, name and all scores which compose the filename.

This certainly requires minor API change for DiskSaver and other savers. However, we recently introduced BaseSaveHandler as base class for savers, so we still can change thing now…

vfdev-5 on May 22, 2020

@bmartinn, @vfdev-5 I’ll get on it 😃

jkhenning on Jun 4, 2020

Would it be possible to instead upload the contents of the /tmp directory after the task is finished?

@achigeor this can be possible, but we should keep in mind that we can loose everything if script crashes before the task is finished and we didn’t catch that.

vfdev-5 on May 21, 2020

Hi @vfdev-5

So, it remains only the switch of save/remove order if we would like to have n_saved instead of n_saved+1 checkpoints.

Indeed 😃

@jkhenning done !

vfdev-5 on Jun 12, 2020

@vfdev-5 I’ve updated my fork’s branch to match the changes we made in Trains to support this feature - we plan to release an RC after this weekend, at which point I’ll open a PR 😃

jkhenning on Jun 10, 2020

@jkhenning could you help with a PR fix?

bmartinn on Jun 4, 2020

@vfdev-5 Great idea, and 👍 for pushing things so quickly - once we’ll add the required support in trains I’ll open a PR 😄

jkhenning on May 23, 2020

@bmartinn PR with metadata is landed. So, you may try to incorporate it into TrainsSaver and give a feedback on that…

vfdev-5 on May 23, 2020

Yes, for a cloud disk savers or other more complex savers where we can set some metadata it can be helpful and avoid parsing the filename 😃

so for example, if the model score is lower then the previous, it can only store it locally and choose to skip the upload

Normally, it is the role of Checkpoint with score function etc.

So, let us create a FR for that.

vfdev-5 on May 22, 2020

@vfdev-5 I agree, I think hard-coded is probably safer, but we should have a way to differentiate between two models being stored, for example GAN training. We brings me back to parsing… How about we inherit Checkpoint, that way we know what’s the prefix and can easily make those decisions. What do you think?

bmartinn on May 22, 2020

@vfdev-5 yes exactly! a minor detail though:

on Trains UI we have a good number of stored artifacts and their meta-data contain the name we intended to store ?

Trains separates artifacts from Models. Basically artifacts you store once, and are tied with an experiment, where as Models you store multiple times (e.g. snapshots) and have their own entity. This means we can easily track models between experiments. Unfortunately Artifacts already received the meta-data feature, and Models are still missing it, but I’m hoping it is coming soon 😉

To the point, models have names that are unrelated to the experiment that created them, (kind of like Artifacts). We can basically use these names as the meta-data. Notice that a user can always choose to rename a model, even after the experiment is completed, with no effect on the actual stored file. This gives you the possibility of renaming your model (in the UI) from “best_model_val_score=<0.XYZ>.pt” to “this is the chosen one! score 0.999.pt”

bmartinn on May 21, 2020

Thanks for the report @achigeor ! Let us reproduce the issue and have deeper look on where is the problem.

cc @jkhenning

vfdev-5 on May 20, 2020