pytorch_geometric: Split Error in RandomLinkSplit
🐛 Bug
When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.
To Reproduce
The link prediction task is as follows:
train_data, val_data, test_data = T.RandomLinkSplit(
num_val=0.1,
num_test=0.1,
neg_sampling_ratio=0.0,
edge_types=[('user', 'rates', 'movie')],
rev_edge_types=[('movie', 'rev_rates', 'user')],
)(data)
I get the following result:
train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)
Expected behavior
The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:
train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)
Environment
- PyG version (
torch_geometric.__version__): 2.0.2 - PyTorch version: (
torch.__version__): 1.10.0 - OS (e.g., Linux): MacOS
- Python version (e.g.,
3.9): 3.8 - CUDA/cuDNN version: CPU
- How you installed PyTorch and PyG (
conda,pip, source): pip - Any other relevant information (e.g., version of
torch-scatter): Not yet.
Additional context
I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.
About this issue
- Original URL
- State: open
- Created 3 years ago
- Reactions: 5
- Comments: 34 (17 by maintainers)
“Training message edges” are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. “Training supervision edges” are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.
This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.
In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.
I think this is totally correct. It seems like you are looking at the shapes of
edge_index, while you may want to look at the shapes ofedge_labelandedge_label_index(which correctly model a 80/10/10 split ratio). Here,edge_indexis solely used for message passing, i.e.,Let me know if this resolves your concerns 😃
It is not the same. If you sample negative training edges in
RandomLinkSplit, these negative samples will be fixed for the whole training procedure. Negative sampling on-the-fly here instead achieves that we are guaranteed to always see a different set of negative samples during training, thus providing a better learning signal (in general).Sorry for hijacking the thread but does the RandomLinkSplit perform splits on
edge_attrand the label tensorytoo? If yes, how do I access the edge attr? BTW my output after splitting is:I am sorry but I am having a hard time interpreting the output of the RandomLinkSplit function.
This is likely due to the
is_undirectedoption since it will only return the upper half of edges for supervision. Is your graph really undirected?Yes, this is correct. Validation and test edges need to always be disjoint.