apex: Training gets stuck when using SyncBN

DistributedDataParallel works great for me. But when I use it together with the synchronized batch normalization, either the Python version or the optimized version, the training will get stuck after a few iterations and the code gives the following warning:

/home/heilaw/.conda/envs/CornerNet/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache))

Any idea how I should debug it?

About this issue

Original URL
State: open
Created 6 years ago
Comments: 19 (2 by maintainers)

Commits related to this issue

[syncBN] replacing new_group with torch.distributed.group.WORLD, avoids creating new group in every iteration. This should resolve the issue in Training gets stuck when using SyncBN #105 — committed to NVIDIA/apex by jjsjann123 5 years ago

Most upvoted comments

I think this issue is related to process_group = group_creator() in optimized_sync_batchnorm_kernel.py. In parallel/__init__.py, you set group_creator to new_group if get_default_group is not available. However, I don’t think that’s a good idea. get_default_group is not available in PyTorch 1.0, so that creates a new group every time we call the sync BN forward function! It looks like we are using the default group anyway. We may not need that line.

After I removed that line and process_group in both torch.distributed.all_reduce and torch.distributed.all_gather, the training now works, even with tqdm.

heilaw on Dec 20, 2018