pytorch-lightning: Allow users to provide custom exception handling

๐Ÿš€ Feature

Allow users to provide custom exception handling via a new callback hook, similar to on_keyboard_interrupt.

Motivation

Users should be able to implement their own error handling if they want.

Pitch

Create a new callback hook here: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/trainer.py#L507-L515

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! โšก

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 3
  • Comments: 17 (14 by maintainers)

Most upvoted comments

I was in the middle of creating my own ticket for that when I saw yours, so I will add the complementary info I would have had added on that subject.

Proposed refactoring or deprecation

feat: add a callback hook for whenever a crash happens This could be implemented in the trainer like for the keyboard interrupt callback.

It does not seem that the teardown callback is called in the case of a crash, but only in the case of a completed training. It feels more appropriate to have a separate callback rather than forcing the use of teardown. It seems the teardown call does not seem to happen since the error is re-raised, thus halting the script before getting to teardown.

Motivation

The motivation behind this is to allow running code on failure of training. Use cases:

  • Add Telegram/Slack notifications on training failure
  • Print useful debugging information through a debugging callback
  • Specific MLFlow/WANDB/etc. logging added on crash

@aurelien-clu : @daniellepintz has dibs because she opened the feature request, next in line is
@yopknopixx but yโ€™all can collaborate with discussion, testing and whoever makes the PR โค๏ธ thanks for your interest in this issue

Hi @yopknopixx I think this might be a good issue for you - https://github.com/PyTorchLightning/pytorch-lightning/issues/8313 LMK what you think! Feel free to start working on it even though it is assigned to me

Hi @yopknopixx Iโ€™ve already started so Iโ€™d prefer to finish this one. But Iโ€™m sure we can find you another issue to work on! @ananthsub do you happen to know of any good issues? Meanwhile I will look for one

Hey @daniellepintz,

Assigned this ticket to you and added to the current sprint.

Best, T.C

that sounds good to me!