apex: Training gets stuck when using SyncBN
DistributedDataParallel
works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:
/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))
Any idea how I should debug it?
About this issue
- Original URL
- State: open
- Created 6 years ago
- Comments: 19 (2 by maintainers)
I think this issue is related to
process_group = group_creator()
inoptimized_sync_batchnorm_kernel.py
. Inparallel/__init__.py
, you setgroup_creator
tonew_group
ifget_default_group
is not available. However, I don’t think that’s a good idea.get_default_group
is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.After I removed that line and
process_group
in bothtorch.distributed.all_reduce
andtorch.distributed.all_gather
, the training now works, even withtqdm
.