apex: Training gets stuck when using SyncBN
DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:
/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))
Any idea how I should debug it?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 19 (2 by maintainers)
I think this issue is related to
process_group = group_creator()inoptimized_sync_batchnorm_kernel.py. Inparallel/__init__.py, you setgroup_creatortonew_groupifget_default_groupis not available. However, I don’t think that’s a good idea.get_default_groupis not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.After I removed that line and
process_groupin bothtorch.distributed.all_reduceandtorch.distributed.all_gather, the training now works, even withtqdm.