hydra: [Bug] I can't see any log with pytorch DDP
🐛 Bug
Description
When I use PyTorch’s Distributed Data Parallel I can’t see any logs in system out and log file is empty except wandb log like below.
[2020-11-06 01:32:08,629][wandb.internal.internal][INFO] - Internal process exited
Checklist
- I checked on the latest version of Hydra
- I created a minimal repro
To reproduce
You can reproduce by running this repo(https://github.com/ryul99/pytorch-project-template/tree/a80f0284c5b22fba2d4892bb906a9bc2b6075838) with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1
(DDP with one gpu)
I’m sorry that this is not that minimal code 😅
** Stack trace/error message **
❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning:
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning:
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
warnings.warn(message=msg, category=UserWarning)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!
Expected Behavior
If you run with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0
(not using DDP), you can see many logs like this.
❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning:
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning:
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
warnings.warn(message=msg, category=UserWarning)
[2020-11-06 01:48:33,990][trainer.py][INFO] - Config:
train:
name: First_training
working_dir: /home/ryul99/Workspace/pytorch-project-template
data:
train_dir: dataset/meta/train
test_dir: dataset/meta/test
file_format: '*.file_extension'
use_background_generator: true
divide_dataset_per_gpu: true
train:
random_seed: 3750
num_epoch: 10000
num_workers: 4
batch_size: 64
optimizer:
mode: adam
adam:
lr: 0.001
betas:
- 0.9
- 0.999
dist:
master_addr: localhost
master_port: '12355'
mode: nccl
gpus: 0
timeout: 30
test:
num_workers: 4
batch_size: 64
model:
device: cuda
log:
use_tensorboard: true
use_wandb: false
wandb_init_conf:
name: ${train.name}
entity: null
project: null
summary_interval: 1
chkpt_interval: 10
chkpt_dir: chkpt
load:
wandb_load_path: null
network_chkpt_path: null
strict_load: false
resume_state_path: null
[2020-11-06 01:48:33,991][trainer.py][INFO] - Set up train process
[2020-11-06 01:48:33,991][trainer.py][INFO] - BackgroundGenerator is turned off when Distributed running is on
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!
[2020-11-06 01:48:38,886][trainer.py][INFO] - Making train dataloader...
[2020-11-06 01:48:38,905][trainer.py][INFO] - Making test dataloader...
[2020-11-06 01:48:40,366][trainer.py][INFO] - Starting new training run.
[2020-11-06 01:48:40,467][train_model.py][INFO] - Train Loss 2.3010 at step 1
[2020-11-06 01:48:40,473][train_model.py][INFO] - Train Loss 2.3133 at step 2
System information
- Hydra Version : 1.0.3
- Python version : 3.8.5
- Virtual environment type and version : pyenv 1.2.21
- Operating system : Ubuntu server 18.04
Additional context
Add any other context about the problem here.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 34 (5 by maintainers)
@ryul99 when Hydra is being initialize, it configures the python logging module(with configuration given by hydra-default or user configs). In this case, just calling
logging.getLogger()
will return properly configured logger which works OK(like in yourmain
function).However, the logging module in training sub processes (spawned by
mp.spawn
) is not properly configured, since hydra is not initialized in this sub-processes. My solution was to make a util function and initialize logging module in that.By the way, I’m currently also working on enabling Hydra and DDP support for a pytorch project template. You can check this branch for my full implementation for DDP.
I’ll create a separate issue to address this.
After looking at #1005, I am not sure anymore. Regardless, this issue has become a hub of unrelated discussions, and I could not find a minimal repro. Happy to look at this if someone produces a minimal example using DDP with Hydra that suffers from logging issues. It should be minimal (a file or two, see the example in #1005 for a good minimal repro). If you can produce it, please file a new issue.
Yes, but it will need be to tomorrow as I just started a long running wav2vec fit using 1.0.4. Soon as it finishes, I’ll test 1.0.6 from the repo and report back.
Finally, I solved this issue in this way. define get_logger function In get logger, the logger is configured as same as hydra’s job_logging config https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/utils/utils.py#L22-L28
add hydra’s job logging config to user’s config at trainer.py https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/trainer.py#L143-L144
In unit test In a unit test, I failed to load hydra’s job logging config with
HydraConfig.get()
. So I used @SunQpark 's way. https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/tests/test_case.py#L33-L58config file This way works because we configure the same logger (file and sys out) at both the main process (by hydra) and sub process (by get_logger) So in this way, you should use hydra/job_logging in custom config (because we should know how to config logger specially at subprocess) https://github.com/ryul99/pytorch-project-template/tree/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/config
Could you share what errors you get when trying to pickle? we actually pickle SingletonSate in one of our plugins
Thanks @SunQpark. Instead of reading the Hydra config from the file system, try to use HydraConfig singleton to access the Hydra configuration like here. I suspect this will not work out of the box because the singletons needs to be initialized in the spawned process.
You can do that by calling this on the main process, and pass the object down to the spawned function.
And then initializing the Singlestons from the state in the spawned processes function: