pytorch_geometric: Split Error in RandomLinkSplit

🐛 Bug

When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.

To Reproduce

The link prediction task is as follows:

train_data, val_data, test_data = T.RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        neg_sampling_ratio=0.0,
        edge_types=[('user', 'rates', 'movie')],
        rev_edge_types=[('movie', 'rev_rates', 'user')],
    )(data)

I get the following result:

train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)

Expected behavior

The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:

train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)

Environment

  • PyG version (torch_geometric.__version__): 2.0.2
  • PyTorch version: (torch.__version__): 1.10.0
  • OS (e.g., Linux): MacOS
  • Python version (e.g., 3.9): 3.8
  • CUDA/cuDNN version: CPU
  • How you installed PyTorch and PyG (conda, pip, source): pip
  • Any other relevant information (e.g., version of torch-scatter): Not yet.

Additional context

I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.

About this issue

  • Original URL
  • State: open
  • Created 3 years ago
  • Reactions: 5
  • Comments: 34 (17 by maintainers)

Most upvoted comments

  1. “Training message edges” are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. “Training supervision edges” are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.

  2. This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.

    In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.

I think this is totally correct. It seems like you are looking at the shapes of edge_index, while you may want to look at the shapes of edge_label and edge_label_index (which correctly model a 80/10/10 split ratio). Here, edge_index is solely used for message passing, i.e.,

  • for training, we exchange messages on all training edges
  • for validation, we exchange messages on all training edges
  • for testing, we exchange messages on all training and validation edges

Let me know if this resolves your concerns 😃

It is not the same. If you sample negative training edges in RandomLinkSplit, these negative samples will be fixed for the whole training procedure. Negative sampling on-the-fly here instead achieves that we are guaranteed to always see a different set of negative samples during training, thus providing a better learning signal (in general).

Sorry for hijacking the thread but does the RandomLinkSplit perform splits on edge_attr and the label tensor y too? If yes, how do I access the edge attr? BTW my output after splitting is:

split_transform = RandomLinkSplit(num_test = 0.2, num_val = 0.1, is_undirected=False)
train_data, val_data, test_data = split_transform(data)

print(train_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[3959028], edge_label_index=[2, 3959028])

print(val_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[565574], edge_label_index=[2, 565574])

print(test_data)

Data(x=[19129, 1], edge_index=[2, 2262301], edge_attr=[2262301, 80], y=[2262301], is_directed=True, edge_label=[1131150], edge_label_index=[2, 1131150])

I am sorry but I am having a hard time interpreting the output of the RandomLinkSplit function.

This is likely due to the is_undirected option since it will only return the upper half of edges for supervision. Is your graph really undirected?

Yes, this is correct. Validation and test edges need to always be disjoint.