SoftTeacher: Model training stops after validation after 4000 iterations
After training for 4000 iterations the validation happens and after that the training stops throwing the following error:
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
tools/train.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-09-22_05:54:53
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 2210236)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
I am training with 2 gpus. Do you have any insight why this error is being thrown?
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 19
What the script does is just to 1) prepare data split for partial setting on COCO 2) Convert
image_info_unlabeled2017.jsontoinstances_unlabeled2017.json. So it makes no sense to run it while adding any other dataset.