dgl: Hanging in Distributed GNN training

🐛 Bug

Hi, I’m running a distirbuted GNN training demo on AWS, but failed and find the process hanging.

To Reproduce

I set two machines and use the code: https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/experimental/train_dist.py

Steps to reproduce the behavior:

Run this script with dgl/tools/launch.py

python3 ~/gnn/dgl/tools/launch.py \
    --workspace ~/graphsage/ \
    --num_trainers 1 \
    --num_samplers 1 \
    --num_servers 1 \
    --part_config cora.json \
    --ip_config ip_config.txt \
    "/home/xinchen/miniconda3/envs/py37/bin/python dgl_code/train_dist.py --graph_name cora --ip_config ip_config.txt --num_servers 1 --num_epochs 30 --batch_size 1000 --num_workers 1 | tee /home/xinchen/graphsage/output.txt

Expected behavior

The log is shown below:

All are prepared, start to train
Using backend: pytorch
Using backend: pytorch
Namespace(batch_size=1000, batch_size_eval=10000, close_profiler=False, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='cora', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, num_servers=1, num_workers=1, part_config=None, standalone=False)
Namespace(batch_size=1000, batch_size_eval=10000, close_profiler=False, dataset=None, dropout=0.5, eval_every=5, fan_out='10,25', graph_name='cora', id=None, ip_config='ip_config.txt', local_rank=0, log_every=20, lr=0.003, n_classes=None, num_clients=None, num_epochs=30, num_gpus=-1, num_hidden=16, num_layers=2, num_servers=1, num_workers=1, part_config=None, standalone=False)
Machine (0) client (2) connect to server successfuly!
Machine (1) client (1) connect to server successfuly!
rank: 0
rank: 1
part 1, train: 70 (local: 70), val: 250 (local: 238), test: 500 (local: 463)
part 0, train: 70 (local: 70), val: 250 (local: 250), test: 500 (local: 500)
#labels: 7
#labels: 7
/home/xinchen/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/home/xinchen/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/home/xinchen/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Part 1 | Epoch 00000 | Step 00000 | Loss 2.0028 | Train Acc 0.0571 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.031 s
/home/xinchen/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount
Part 1 | Epoch 00000 | Step 00000 | Loss 2.0028 | Train Acc 0.0571 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.031 s
/home/xinchen/miniconda3/envs/py37/lib/python3.7/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Part 1, Epoch Time(s): 0.1003, sample+data_copy: 0.0673, forward: 0.0248, backward: 0.0037, update: 0.0020, #seeds: 70, #inputs: 785
Part 0 | Epoch 00000 | Step 00000 | Loss 1.9378 | Train Acc 0.2286 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.037 s
Part 0, Epoch Time(s): 0.1008, sample+data_copy: 0.0613, forward: 0.0262, backward: 0.0082, update: 0.0018, #seeds: 70, #inputs: 784
Part 0 | Epoch 00001 | Step 00000 | Loss 1.9284 | Train Acc 0.2286 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.033 s
Part 1 | Epoch 00001 | Step 00000 | Loss 1.9879 | Train Acc 0.0571 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.029 s
Part 1, Epoch Time(s): 0.0867, sample+data_copy: 0.0570, forward: 0.0239, backward: 0.0030, update: 0.0011, #seeds: 70, #inputs: 798
Part 0, Epoch Time(s): 0.0863, sample+data_copy: 0.0523, forward: 0.0237, backward: 0.0073, update: 0.0012, #seeds: 70, #inputs: 781
Part 0 | Epoch 00002 | Step 00000 | Loss 1.9186 | Train Acc 0.2286 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.030 s
Part 1 | Epoch 00002 | Step 00000 | Loss 1.9771 | Train Acc 0.0571 | Speed (samples/sec) nan | GPU 0.0 MiB | time 0.033 s
Part 1, Epoch Time(s): 0.0691, sample+data_copy: 0.0350, forward: 0.0241, backward: 0.0081, update: 0.0011, #seeds: 70, #inputs: 795Part 0, Epoch Time(s): 0.0688, sample+data_copy: 0.0384, forward: 0.0257, backward: 0.0028, update: 0.0011, #seeds: 70, #inputs: 799

Part 0 | Epoch 00003 | Step 00000 | Loss 1.9033 | Train Acc 0.2286 | Speed (samples/sec) 2401.8626 | GPU 0.0 MiB | time 0.029 s
Part 1 | Epoch 00003 | Step 00000 | Loss 1.9675 | Train Acc 0.0571 | Speed (samples/sec) 2167.8230 | GPU 0.0 MiB | time 0.032 sPart 0, Epoch Time(s): 0.0646, sample+data_copy: 0.0348, forward: 0.0251, backward: 0.0029, update: 0.0010, #seeds: 70, #inputs: 800

Part 1, Epoch Time(s): 0.0660, sample+data_copy: 0.0331, forward: 0.0213, backward: 0.0088, update: 0.0011, #seeds: 70, #inputs: 797
Part 0 | Epoch 00004 | Step 00000 | Loss 1.8955 | Train Acc 0.2286 | Speed (samples/sec) 2259.0683 | GPU 0.0 MiB | time 0.033 s
Part 1 | Epoch 00004 | Step 00000 | Loss 1.9576 | Train Acc 0.0571 | Speed (samples/sec) 2172.3315 | GPU 0.0 MiB | time 0.032 s
Part 0, Epoch Time(s): 0.0682, sample+data_copy: 0.0345, forward: 0.0232, backward: 0.0077, update: 0.0011, #seeds: 70, #inputs: 783Part 1, Epoch Time(s): 0.0682, sample+data_copy: 0.0355, forward: 0.0230, backward: 0.0070, update: 0.0011, #seeds: 70, #inputs: 837

|V|=2708, eval batch size: 10000
|V|=2708, eval batch size: 10000
1it [00:00, 12.14it/s]
1it [00:00,  7.94it/s]
|V|=2708, eval batch size: 10000|V|=2708, eval batch size: 10000

1it [00:00, 18.02it/s]
Machine (1) client (0) connect to server successfuly!
Using backend: pytorc

And then it hangs and does not release its CPU resources.

Environment

DGL Version (e.g., 1.0): 0.5.2
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.6.0 (CPU only)
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): conda
Build command you used (if compiling from source): /
Python version: python 3.7.0
CUDA/cuDNN version (if applicable): /
GPU models and configuration (e.g. V100): /
Any other relevant information:

Additional context

I suggest it might because of no output: Machine (0) client (3) connect to server successfuly!. I have two questions:

Is this exactly why the program hangs?
In launch.py, the tot_num_client = args.num_trainers * (1 + args.num_samplers) * len(hosts). Would you please tell me why +1?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15 (5 by maintainers)

Most upvoted comments

it seems there are still some problems with process pool. we’ll try to fix it in the next release.

zheng-da on Dec 25, 2020