pytorch-metric-learning: Error in computing similarity with multiple GPUs
Hi Kevin. Thank you for providing this wonderful code for metric learning. I am facing a weird issue for which I seek your guidance. When I use a simple contrastive loss to train my network on multiple GPUs. I get the following error.
File "../pytorch_metric_learning/distances/base_distance.py", line 26, in forward
query_emb, ref_emb, query_emb_normalized, ref_emb_normalized
File "../pytorch_metric_learning/distances/base_distance.py", line 74, in set_default_stats
self.get_norm(query_emb)
RuntimeError: CUDA error: device-side assert triggered
However, this error does not occur when I train my network on a single GPU. Could you please let me know what might cause this issue and its possible fix? Its not the pytorch version issue because my torch==1.7.1 and torchvision==0.8.2. I have used your latest code for metric learning.
Thanks.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 21 (13 by maintainers)
Maybe the embeddings are spread across multiple devices before they are passed to the loss function. You could try moving them to a single device:
Hi Kevin, sorry for the delay in response.
So, the problem is not with the code but with the way in which the similarities of positive and negative pairs are computed across multiple GPUs. So, in the code below
in my case, the similarity matrix and the indices of the positive and negative pairs get split across these GPUs in such a way that the indexing becomes out of bounds. I notice that the matrix is not evenly split between the GPUs. I think the solution would be to collect them all in one GPU and then move forward with computing the loss.