ignite: TrainsSaver doesn't respect Checkpoint's n_saved
🐛 Bug description
As the title says, it seems that TrainsSaver
bypasses the Checkpoint
n_saved
parameter. That means that all models are saved and never updated / deleted.
Consider this simple example:
task.phases['train'].add_event_handler(
Events.EPOCH_COMPLETED(every=1),
Checkpoint(to_save, TrainsSaver(output_uri=output_uri), 'epoch', n_saved=1,
global_step_transform=global_step_from_engine(task.phases['train'])))
The above saves every checkpoint. You end-up with
epoch_checkpoint_1.pt
epoch_checkpoint_2.pt
epoch_checkpoint_3.pt
...
Now if we do, the same with DiskSaver
:
task.phases['train'].add_event_handler(
Events.EPOCH_COMPLETED(every=1),
Checkpoint(to_save, DiskSaver(dirname=dirname), 'epoch', n_saved=1,
global_step_transform=global_step_from_engine(task.phases['train'])))
We get only:
epoch_checkpoint_3.pt
as expected.
Same behaviour if we save only best models using score_function
, i.e. TrainsSaver
saves every best model.
Environment
- PyTorch Version: 1.3.1
- Ignite Version: 0.4.0.dev20200519 (EDIT: update to latest nightly, issue still exists)
- OS: Linux
- How you installed Ignite: pip nightly
- Python version: 3.6
- Any other relevant information: trains version: 0.14.3
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (17 by maintainers)
Commits related to this issue
- Change callback structure, access thorough designated class WeightsFileHandler.ModelInfo See https://github.com/pytorch/ignite/issues/1056 — committed to allegroai/clearml by deleted user 4 years ago
Thanks @vfdev-5 ! kudos for the quick PR! We will add it into
TrainsServer
(probably needs to add a bit of support inTrains
as well) I’ll update here once the PR is readyI was also thinking about if we can pass more info to
save_handler
in addition toobject_to_save
andfilename
: https://github.com/pytorch/ignite/blob/c012166f93e56f8e9538741f5745a5010983ba38/ignite/handlers/checkpoint.py#L21For example, we can opt to pass some meta-info about the checkpoint to save:
and in metadata we can pass
prefix
,name
and all scores which compose the filename.This certainly requires minor API change for
DiskSaver
and other savers. However, we recently introducedBaseSaveHandler
as base class for savers, so we still can change thing now…@bmartinn, @vfdev-5 I’ll get on it 😃
@achigeor this can be possible, but we should keep in mind that we can loose everything if script crashes before the
task is finished
and we didn’t catch that.@jkhenning done !
@vfdev-5 I’ve updated my fork’s branch to match the changes we made in Trains to support this feature - we plan to release an RC after this weekend, at which point I’ll open a PR 😃
@jkhenning could you help with a PR fix?
@vfdev-5 Great idea, and 👍 for pushing things so quickly - once we’ll add the required support in trains I’ll open a PR 😄
@bmartinn PR with metadata is landed. So, you may try to incorporate it into
TrainsSaver
and give a feedback on that…Yes, for a cloud disk savers or other more complex savers where we can set some metadata it can be helpful and avoid parsing the filename 😃
Normally, it is the role of
Checkpoint
with score function etc.So, let us create a FR for that.
@vfdev-5 I agree, I think hard-coded is probably safer, but we should have a way to differentiate between two models being stored, for example GAN training. We brings me back to parsing… How about we inherit Checkpoint, that way we know what’s the prefix and can easily make those decisions. What do you think?
@vfdev-5 yes exactly! a minor detail though:
Trains separates artifacts from Models. Basically artifacts you store once, and are tied with an experiment, where as Models you store multiple times (e.g. snapshots) and have their own entity. This means we can easily track models between experiments. Unfortunately Artifacts already received the meta-data feature, and Models are still missing it, but I’m hoping it is coming soon 😉
To the point, models have names that are unrelated to the experiment that created them, (kind of like Artifacts). We can basically use these names as the meta-data. Notice that a user can always choose to rename a model, even after the experiment is completed, with no effect on the actual stored file. This gives you the possibility of renaming your model (in the UI) from “
best_model_val_score=<0.XYZ>.pt
” to “this is the chosen one! score 0.999.pt
”Thanks for the report @achigeor ! Let us reproduce the issue and have deeper look on where is the problem.
cc @jkhenning