pytorch_geometric: Interface for datasets that are too large to use `InMemoryDataset`

🚀 The feature, motivation and pitch

There are several examples of datasets for molecular property prediction where each individual graph example easily fits in memory but there are too many examples to fit within the InMemoryDataset interface. One solution is to save each example in its own .pt file but this introduces a significant filesystem overhead to access each example.

A better solution is to partition the data such that there are many graphs serialised within a single .pt file. The number of graphs can be considered a chunk_size parameter which is independent from the training batch_size. This ChunkedDataset interface would be expected to scale to as large a dataset as desired while avoiding the significant overhead of having one graph per file.

The design idea is roughly:

  • ChunkedDataset inherits from the PyG Dataset interface
  • Accepts a chunk_size argument
  • Has an abstract method process_chunk that accepts a list of data objects that can be processed and saved as a single .pt file.

Other considerations:

  • The training batch size should not depend on the chunk_size so the dataset
  • ChunkedDataset should support splitting to read from parallel workers as well as random shuffling

Alternatives

No response

Additional context

No response

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 6
  • Comments: 23 (15 by maintainers)

Most upvoted comments

Hi, I’d like to build this chunkeddataset support if no one else is doing it. 😃 My own project also needs that support.

So basically my current thoughts are that users need to write the chunk logic in process() and store them in some data structures such as self.chunked_data and self.chunked_slices (both are lists right now, compared with a single data/slices in InMemoryDataset), so that the ChunkedDataset will be able to load them in the len() and get() method (can be on-demand or prefetching)?

I have implemented a hard-coded version of chunkeddataset (only apply to my dataset) and it worked. I will work to retrofit my code to a generic version.

I agree, we probably need to require a process_example method and do any logic of creating chunks internally. WDYT?

The best approach that I see for this is to add support for PyTorch’s IterableDataset by creating an ‘IterableDataset’ version of PyG’s Dataset class. Then, to support dataset chunk loading, provide examples for processing large graph datasets into shards and loading them on-the-fly via implementations of the added PyG IterableDataset class. As noted above, this inherits from the PyTorch solution to this issue.

Using IterableDataset solves some of the issues around matching batch size to chunk size, fetching index specified dataset samples from out of memory chunks, and the challenge of implementing reusable logic for chunk saving and loading, which will likely be dependent on the user’s specific application.

Any thoughts or comments?

You might be able to connect your Neo4j graph via the remote backend interface PyG provides, see here.

@hatemhelal PyG has some good support for data pipes already, see https://github.com/pyg-team/pytorch_geometric/blob/master/examples/datapipe.py. Perhaps this might be useful to implement this.

Is anyone working on this?

For my own purposes, I’m currently implementing logic to save processed data as chunks to disk and load on-the-fly to a dataset for training.

How should batch size on the dataset be handled to match with chunk on-the-fly loading? It would defeat the purpose if all chunks are loaded into memory at any given time, so batching needs to match the chunk indices to some extent - and loading chunks needs to write into the same memory space.

@LiuHaolan that would be amazing if you wanted to get this going as it was going to take me a few weeks before I can work on this. Happy to help with any PR reviews

@LiuHaolan Amazing, let me know if we can help in any way.