pytorch_sparse: RuntimeError: CUDA error: an illegal memory access was encountered

  File "examples/sem_seg_sparse/train.py", line 142, in <module>
    main()
  File "examples/sem_seg_sparse/train.py", line 61, in main
    train(model, train_loader, optimizer, scheduler, criterion, opt)
  File "examples/sem_seg_sparse/train.py", line 79, in train
    out = model(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/drive/My Drive/deep_gcns_torch/examples/sem_seg_sparse/architecture.py", line 69, in forward
    feats.append(self.gunet(feats[-1],edge_index=edge_index ,batch=batch))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 83, in forward
    x.size(0))
  File "/usr/local/lib/python3.6/dist-packages/torch_geometric/nn/models/graph_unet.py", line 120, in augment_adj
    num_nodes)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/spspmm.py", line 30, in spspmm
    C = matmul(A, B)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 107, in matmul
    return spspmm(src, other, reduce)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 95, in spspmm
    return spspmm_sum(src, other)
  File "/usr/local/lib/python3.6/dist-packages/torch_sparse/matmul.py", line 83, in spspmm_sum
    rowptrA, colA, valueA, rowptrB, colB, valueB, K)
RuntimeError: CUDA error: an illegal memory access was encountered (launch_kernel at /pytorch/aten/src/ATen/native/cuda/Loops.cuh:103)

hi, i’m intergrating the GraphU-Net and other model on the google colab, but there are some bug , could you help me ? thanks.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 20 (8 by maintainers)

Commits related to this issue

Most upvoted comments

The error seems to stem from the fact cuSPARSE cannot handle duplicated edges in edge_index. The reason for that is that it fails to compute the correct amount of output edges this way. In your case, it might well be that you have some initial self-loop edges in your graph, which should be removed before calling add_self_loops. I think your fix for augment_adj is correct, and I added it to the GraphUNet model in PyG.

@vthost @rusty1s Hi, I also met this error when use my own dataset to train Graph-UNet. This error randomly occurred when using GPU but never occurred when using CPU. I changed the augment_adj function, added the remove_self_loops function at first, and then the problem was solved. But I don’t know why.

def augment_adj(self, edge_index, edge_weight, num_nodes):
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    edge_index, edge_weight = add_self_loops(edge_index, edge_weight, num_nodes=num_nodes)
    edge_index, edge_weight = sort_edge_index(edge_index, edge_weight, num_nodes)
    edge_index, edge_weight = spspmm(edge_index, edge_weight, edge_index, edge_weight, num_nodes, num_nodes, num_nodes)
    edge_index, edge_weight = remove_self_loops(edge_index, edge_weight)
    return edge_index, edge_weight

I don’t think that’s related to the above issue. You may have a memory leak somewhere, or one of your graphs in your dataset is too large that it can not be handled in a full-batch fashion.

I now have this also with ASAPool 😦 Screen Shot 2020-08-16 at 5 40 36 PM

I am using ogbg-code. The example code for that data adds two types of edges to the graph in utils.augment_edge, so we might have several edges between two nodes. I tried to add coalesced=True in graph_unet.augment_adj as argument to spspmm but the error is still the same. It seems that spspmm interprets the coalesced argument as “sorted”. After I added the following in the beginning of graph_unet.forward (after the initialization of the edge weights), it runs for 74/143 epochs, and then the error comes again. If I add it in graph_unet.augment_adj, the training runs through, but I get the same error in the evaluation in remove_self_loops because the mask does not fit edge_attr[mask]. Just as an update… edge_index, edge_weight = coalesce(edge_index, edge_weight, x.shape[0], x.shape[0])

Can you show me an example code? For example, my GraphU-Net script runs just fine. Note that you need to pass coalesced=True if your edge_index is not sorted.