apex: Training gets stuck when using SyncBN

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))

Any idea how I should debug it?

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Comments: 19 (2 by maintainers)

Commits related to this issue

Most upvoted comments

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don’t think that’s a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.