SoftTeacher: Model training stops after validation after 4000 iterations

After training for 4000 iterations the validation happens and after that the training stops throwing the following error:

raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
         tools/train.py FAILED         
=======================================
Root Cause:
[0]:
  time: 2021-09-22_05:54:53
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2210236)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
  <NO_OTHER_FAILURES>
***************************************

I am training with 2 gpus. Do you have any insight why this error is being thrown?

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 19

Most upvoted comments

What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert image_info_unlabeled2017.json to instances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.