dgl: P-GNN running error: MemoryError from multiprocessing floders

❓ Questions and Help

When I installed the dependencies according to the prompts and ran the connection prediction, the following error occurred: `Process SpawnPoolWorker-6: Traceback (most recent call last): File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\pool.py”, line 131, in worker put((job, i, result)) File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\queues.py”, line 362, in put obj = _ForkingPickler.dumps(obj) File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\reduction.py”, line 51, in dumps cls(buf, protocol).dump(obj) MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\process.py”, line 315, in _bootstrap self.run() File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\process.py”, line 108, in run self._target(*self._args, **self._kwargs) File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\pool.py”, line 133, in worker wrapped = MaybeEncodingError(e, result[1]) File “D:\ProgramData\Anaconda3\envs\GNN\lib\multiprocessing\pool.py”, line 86, in init self.value = repr(value) MemoryError`

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 23 (13 by maintainers)

Most upvoted comments

@limaoSure Given that you haven’t provided me with the size of your physical memory, I assume you have a standard 16GB of memory. As mentioned above, sampling all the anchor for each epoch in one experiment would roughly consume 12-20GB of memory. Here are two solutions I can suggest:

  1. You can modify the anchor sampling function to sample anchor for only N epochs at a time. You can call it every N epoch to reduce memory consumption.
  2. You can rewrite the anchor sampling function as a Pytorch dataset and utilize a data loader for prefetching.

@limaoSure Here is a sample code for solution 2:

class PGNNDataset(Dataset):
    def __init__(self, data, args):
        self.data = data
        self.anchor_set_ids = [get_anchors(self.data["num_nodes"]) for _ in range(args.epoch_num)]

    def __len__(self):
        return len(self.anchor_set_ids)

    def __getitem__(self, idx):
        anchor_set = self.anchor_set_ids[idx]
        dists_max, dists_argmax = get_dist_max(anchor_set, self.data["dists"])
        g, anchor_eid, edge_weight = get_a_graph(dists_max, dists_argmax)

        return g, anchor_eid, dists_max, edge_weight

data = get_dataset(args)
dataset = PGNNDataset(data, args)
dataloader = DataLoader(dataset, batch_size=1, shuffle=False, num_workers=2, prefetch_factor=2,
                        collate_fn=lambda x: x[0])

for epoch, (g, anchor_eid, dist_max, edge_weight) in enumerate(dataloader):
    if epoch == 200:
        for param_group in optimizer.param_groups:
            param_group["lr"] /= 10

    g = dgl.graph(g)
    g.ndata["feat"] = torch.FloatTensor(data["feature"])
    g.edata["sp_dist"] = torch.FloatTensor(edge_weight)
    g_data = {
        "graph": g.to(device),
        "anchor_eid": anchor_eid,
        "dists_max": dist_max,
    }

    train_model(data, model, loss_func, optimizer, device, g_data)