MONAI: Add interrupt/terminate option to Trainers & Evaluators
Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.
Describe the solution you’d like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.
For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL’s MonaiAlgo class; private repo - contact me if you need access)
def abort(self):
self.trainer.terminate()
# save current iteration for next round
setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)
if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
# if current iteration is end of 1 epoch, manually trigger epoch completed event
self.trainer._fire_event(Events.EPOCH_COMPLETED)
def finalize(self):
self.trainer.terminate()
Describe alternatives you’ve considered n/a
Additional context n/a
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 29 (24 by maintainers)
Commits related to this issue
- Use Ignite's interrupt api in MonaiAlgo (#5071) Fixes #4554. Requires latest Ignite RC. Also improves ClientAlgo docstrings formatting. ### Description A few sentences describing the chang... — committed to Project-MONAI/MONAI by holgerroth 2 years ago
- Use Ignite's interrupt api in MonaiAlgo (#5071) Fixes #4554. Requires latest Ignite RC. Also improves ClientAlgo docstrings formatting. ### Description A few sentences describing the chang... — committed to yashika-git/MONAI by holgerroth 2 years ago
Hi @Nic-Ma
We will try to land the code this week and it will be on master and nightly release. Next, we’ll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.
Thanks
@Nic-Ma allocated memory can be released (context memory maybe not):
@holgerroth I have few questions about Engine.interrupt (==
abortfrom MonaiAlgo).More details on this question:
Engine.run, then before interrupting the engine we see the following events:where numbers
(1)indicate epoch and iteration indices.Engine.interrupt()from a handler it would triggerINTERRUPTevent and exitEngine.runmethod while properly storingdataloader_iterfor resuming. For an interruption in a middle of epoch, it would beEngine.run, should we see once again the events STARTED, EPOCH_STARTED ?I would say that we may want to skip these events: STARTED, EPOCH_STARTED (12).
Any particular reason to store
dataloader_iterin state when engine is interrupted ?About the code : https://github.com/Project-MONAI/monai-fl/blob/ab28be402d48687ed9d42f6c8afa1c0cda7e70b2/monaifl/monai_algo.py#L175-L177
Can’t we instead call
engine.interrupt()directly on Events.EPOCH_COMPLETED such that all necessary handlers were already executed ?What do you think ?
Hi @vfdev-5 ,
When you have the first RC release, please let @holgerroth and me know and test ASAP.
Thanks in advance.
not yet, but soon 😃
sounds good 👍
Sorry for delayed reply @Nic-Ma , our plan is to release around the end of august.
@Nic-Ma let us try to include this feature to our next upcoming release
Hi @vfdev-5, thanks for the consideration! in mid-September we plan to have a major milestone and ideally we could include this feature from pytorch-ignite.
Unfortunately, Engine refactoring can be complicated to release quickly. It lasts since some time already and I feel that we have to change the strategy for adopting such large things… Maybe we could just add the feature you would like first and do the refactoring a bit later. I’ll let you know once we have concrete schedule.
Hi @Nic-Ma , this feature has been already in the plan for Engine refactoring. We have to push forward a bit more on that.
Hi everyone , thanks for pinging.
Yes,
interuptfeature may be nice to have in ignite. Today, it is possible to resume training from stopped trainer (under certain assumptions). I’ll give an example below.As for pausing and resuming current training iteration, this can be a bit tricky regarding the input dataloader. Basically, we have to store the iterator and make sure that resuming continues iterating over new batches and not from the begging…
Let me prioritize this feature from ignite side.
Run/Resume Ignite Engine logic:
Thanks @holgerroth and @Nic-Ma for bringing this topic for the discussion.
This is one of the desired features we are waiting for.
The interrupt calls should be supported by Ignite trainer for MONAI apps.
Considering the complete training or evaluation cycles in remote machines in FL environments. Following interrupt calls could be handy:
NVDIA, MONAI, and Ignite developers can suggest better and more compliant naming conventions for these functionalities.