MONAI: Add interrupt/terminate option to Trainers & Evaluators

Is your feature request related to a problem? Please describe. It should be possible to abort/finalize a running Trainer by calling the API (rather than ctr+C). This will be helpful if the Trainer needs to be executed remotely, such as in federated learning (FL) scenarios.

Describe the solution you’d like Add abort() and finalize() functions to the Trainer class (or potentially its base class). Note, finalize() should terminate the training completely, while abort() should allow later continue of where it was aborted(), by calling run() again.

For example, an ignite-based Trainer support abort() and finalize() calls could be implemented as such (Currently used in MONAI-FL’s MonaiAlgo class; private repo - contact me if you need access)

    def abort(self):
        self.trainer.terminate()
        # save current iteration for next round
        setattr(self.trainer.state, "dataloader_iter", self.trainer._dataloader_iter)

        if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
            # if current iteration is end of 1 epoch, manually trigger epoch completed event
            self.trainer._fire_event(Events.EPOCH_COMPLETED)

    def finalize(self):
        self.trainer.terminate()

Describe alternatives you’ve considered n/a

Additional context n/a

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 29 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Hi @Nic-Ma

We will try to land the code this week and it will be on master and nightly release. Next, we’ll schedule our regular 0.4.10 release where this feature will be present if tests on nightly from your side can confirm that we are good.

Thanks

@Nic-Ma allocated memory can be released (context memory maybe not):

import torch

print("-", torch.cuda.memory_allocated() / 1024 / 1024)

x = torch.rand(100, 100, 100, 100, device="cuda")
print("--", torch.cuda.memory_allocated() / 1024 / 1024)

x = None
print("---", torch.cuda.memory_allocated() / 1024 / 1024)
- 0.0
-- 382.0
--- 0.0

@holgerroth I have few questions about Engine.interrupt (== abort from MonaiAlgo).

  1. While resuming from interrupted state (e.g. in a middle of an epoch) do we expect to see again Events.STARTED, Events.EPOCH_STARTED ?

More details on this question:

  • When we first call Engine.run, then before interrupting the engine we see the following events:
STARTED, EPOCH_STARTED (1), ..., ITERATION_STARTED (1), ITERATION_COMPLETED (1), ..., EPOCH_COMPLETED (1), ...

where numbers (1) indicate epoch and iteration indices.

  • When we call Engine.interrupt() from a handler it would trigger INTERRUPT event and exit Engine.run method while properly storing dataloader_iter for resuming. For an interruption in a middle of epoch, it would be
EPOCH_STARTED (12) ... ITERATION_STARTED (123), ITERATION_COMPLETED (123), INTERRUPT
  • When we resume the run by calling again Engine.run, should we see once again the events STARTED, EPOCH_STARTED ?
STARTED, EPOCH_STARTED (12), ITERATION_STARTED (124), ITERATION_COMPLETED (124), ...

I would say that we may want to skip these events: STARTED, EPOCH_STARTED (12).

  1. Any particular reason to store dataloader_iter in state when engine is interrupted ?

  2. About the code : https://github.com/Project-MONAI/monai-fl/blob/ab28be402d48687ed9d42f6c8afa1c0cda7e70b2/monaifl/monai_algo.py#L175-L177

            if self.trainer.state.iteration % self.trainer.state.epoch_length == 0:
                # if current iteration is end of 1 epoch, manually trigger epoch completed event
                self.trainer._fire_event(Events.EPOCH_COMPLETED)

Can’t we instead call engine.interrupt() directly on Events.EPOCH_COMPLETED such that all necessary handlers were already executed ?

What do you think ?

Hi @vfdev-5 ,

When you have the first RC release, please let @holgerroth and me know and test ASAP.

Thanks in advance.

Is the interrupt / resume feature already in the dev branch of ignite?

not yet, but soon 😃

I think maybe @holgerroth can help test it before you release it.

sounds good 👍

Sorry for delayed reply @Nic-Ma , our plan is to release around the end of august.

@Nic-Ma let us try to include this feature to our next upcoming release

Hi @vfdev-5, thanks for the consideration! in mid-September we plan to have a major milestone and ideally we could include this feature from pytorch-ignite.

Unfortunately, Engine refactoring can be complicated to release quickly. It lasts since some time already and I feel that we have to change the strategy for adopting such large things… Maybe we could just add the feature you would like first and do the refactoring a bit later. I’ll let you know once we have concrete schedule.

Hi @Nic-Ma , this feature has been already in the plan for Engine refactoring. We have to push forward a bit more on that.

Hi everyone , thanks for pinging.

Yes, interupt feature may be nice to have in ignite. Today, it is possible to resume training from stopped trainer (under certain assumptions). I’ll give an example below.

As for pausing and resuming current training iteration, this can be a bit tricky regarding the input dataloader. Basically, we have to store the iterator and make sure that resuming continues iterating over new batches and not from the begging…

Let me prioritize this feature from ignite side.

Run/Resume Ignite Engine logic:

# (re)start from 0 to 5
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# continue from 5 to 7
engine.run(data, max_epochs=7) -> Engine run resuming from iteration 50, epoch 5 until 7 epochs => state.epoch=7

# error
engine.run(data, max_epochs=4) -> ValueError: Argument max_epochs should be larger than the start epoch

# restart from 0 to 7 (As state.epoch == max_epochs(=7), this should be like that as we always do: evaluator.run(data) without any other instructions)
engine.run(data, max_epochs=7) -> Engine run starting with max_epochs=7 => state.epoch=7

# forced restart from 0 to 5
engine.state.max_epochs = None
engine.run(data, max_epochs=5) -> Engine run starting with max_epochs=5 => state.epoch=5

# forced restart from 0 to 9, instead of continue from state.epoch=7
engine.state.max_epochs = None
engine.run(data, max_epochs=9) -> Engine run starting with max_epochs=9 => state.epoch=9

Thanks @holgerroth and @Nic-Ma for bringing this topic for the discussion.

This is one of the desired features we are waiting for.

The interrupt calls should be supported by Ignite trainer for MONAI apps.

Considering the complete training or evaluation cycles in remote machines in FL environments. Following interrupt calls could be handy:

  1. trainer_pause --> to pause the current training iteration
  2. trainer_resume --> to resume from last paused training iteration
  3. trainer_abort --> to terminate current iteration and discard any update in the model_state_dict
  4. evaluator_abort --> to terminate current evaluation and discard any computed metric

NVDIA, MONAI, and Ignite developers can suggest better and more compliant naming conventions for these functionalities.