pytorch-metric-learning: Error in computing similarity with multiple GPUs

Hi Kevin. Thank you for providing this wonderful code for metric learning. I am facing a weird issue for which I seek your guidance. When I use a simple contrastive loss to train my network on multiple GPUs. I get the following error.

File "../pytorch_metric_learning/distances/base_distance.py", line 26, in forward
    query_emb, ref_emb, query_emb_normalized, ref_emb_normalized
  File "../pytorch_metric_learning/distances/base_distance.py", line 74, in set_default_stats
    self.get_norm(query_emb)
RuntimeError: CUDA error: device-side assert triggered

However, this error does not occur when I train my network on a single GPU. Could you please let me know what might cause this issue and its possible fix? Its not the pytorch version issue because my torch==1.7.1 and torchvision==0.8.2. I have used your latest code for metric learning.
Thanks.

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 21 (13 by maintainers)

Most upvoted comments

Maybe the embeddings are spread across multiple devices before they are passed to the loss function. You could try moving them to a single device:

gpu0 = torch.device('cuda:0')
embeddings = embeddings.to(gpu0)
labels = labels.to(gpu0)
loss = loss_fn(embeddings, labels)

KevinMusgrave on May 4, 2021

Hi Kevin, sorry for the delay in response.
So, the problem is not with the code but with the way in which the similarities of positive and negative pairs are computed across multiple GPUs. So, in the code below

if len(a1) > 0:
      pos_pair_dist = similarity_matrix[a1, p]
if len(a2) > 0:
      neg_pair_dist = similarity_matrix[a2, n]

in my case, the similarity matrix and the indices of the positive and negative pairs get split across these GPUs in such a way that the indexing becomes out of bounds. I notice that the matrix is not evenly split between the GPUs. I think the solution would be to collect them all in one GPU and then move forward with computing the loss.

shashankvkt on May 1, 2021