hydra: [Bug] I can't see any log with pytorch DDP

🐛 Bug

Description

When I use PyTorch’s Distributed Data Parallel I can’t see any logs in system out and log file is empty except wandb log like below.

[2020-11-06 01:32:08,629][wandb.internal.internal][INFO] - Internal process exited

Checklist

  • I checked on the latest version of Hydra
  • I created a minimal repro

To reproduce

You can reproduce by running this repo(https://github.com/ryul99/pytorch-project-template/tree/a80f0284c5b22fba2d4892bb906a9bc2b6075838) with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1(DDP with one gpu) I’m sorry that this is not that minimal code 😅 ** Stack trace/error message **

❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=1
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning: 
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
  warnings.warn(message=msg, category=UserWarning)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!

Expected Behavior

If you run with python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0(not using DDP), you can see many logs like this.

❯ python trainer.py train.working_dir=$(pwd) train.train.dist.gpus=0
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/core/utils.py:204: UserWarning: 
Using config_path to specify the config name is deprecated, specify the config name via config_name
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes
  warnings.warn(category=UserWarning, message=msg)
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/hydra/plugins/config_source.py:190: UserWarning: 
Missing @package directive train/default.yaml in file:///home/ryul99/Workspace/pytorch-project-template/config.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
  warnings.warn(message=msg, category=UserWarning)
[2020-11-06 01:48:33,990][trainer.py][INFO] - Config:
train:
  name: First_training
  working_dir: /home/ryul99/Workspace/pytorch-project-template
  data:
    train_dir: dataset/meta/train
    test_dir: dataset/meta/test
    file_format: '*.file_extension'
    use_background_generator: true
    divide_dataset_per_gpu: true
  train:
    random_seed: 3750
    num_epoch: 10000
    num_workers: 4
    batch_size: 64
    optimizer:
      mode: adam
      adam:
        lr: 0.001
        betas:
        - 0.9
        - 0.999
    dist:
      master_addr: localhost
      master_port: '12355'
      mode: nccl
      gpus: 0
      timeout: 30
  test:
    num_workers: 4
    batch_size: 64
  model:
    device: cuda
  log:
    use_tensorboard: true
    use_wandb: false
    wandb_init_conf:
      name: ${train.name}
      entity: null
      project: null
    summary_interval: 1
    chkpt_interval: 10
    chkpt_dir: chkpt
  load:
    wandb_load_path: null
    network_chkpt_path: null
    strict_load: false
    resume_state_path: null

[2020-11-06 01:48:33,991][trainer.py][INFO] - Set up train process
[2020-11-06 01:48:33,991][trainer.py][INFO] - BackgroundGenerator is turned off when Distributed running is on
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting dataset/meta/MNIST/raw/train-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting dataset/meta/MNIST/raw/train-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting dataset/meta/MNIST/raw/t10k-images-idx3-ubyte.gz to dataset/meta/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting dataset/meta/MNIST/raw/t10k-labels-idx1-ubyte.gz to dataset/meta/MNIST/raw
Processing...
/home/ryul99/.pyenv/versions/LWD/lib/python3.8/site-packages/torchvision/datasets/mnist.py:480: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:141.)
  return torch.from_numpy(parsed.astype(m[2], copy=False)).view(*s)
Done!
[2020-11-06 01:48:38,886][trainer.py][INFO] - Making train dataloader...
[2020-11-06 01:48:38,905][trainer.py][INFO] - Making test dataloader...
[2020-11-06 01:48:40,366][trainer.py][INFO] - Starting new training run.
[2020-11-06 01:48:40,467][train_model.py][INFO] - Train Loss 2.3010 at step 1
[2020-11-06 01:48:40,473][train_model.py][INFO] - Train Loss 2.3133 at step 2

System information

  • Hydra Version : 1.0.3
  • Python version : 3.8.5
  • Virtual environment type and version : pyenv 1.2.21
  • Operating system : Ubuntu server 18.04

Additional context

Add any other context about the problem here.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 34 (5 by maintainers)

Most upvoted comments

@ryul99 when Hydra is being initialize, it configures the python logging module(with configuration given by hydra-default or user configs). In this case, just calling logging.getLogger() will return properly configured logger which works OK(like in your main function).

However, the logging module in training sub processes (spawned by mp.spawn) is not properly configured, since hydra is not initialized in this sub-processes. My solution was to make a util function and initialize logging module in that.

def is_master():
    return not dist.is_initialized() or dist.get_rank() == 0

def get_logger(name=None):
    if is_master():
        # TODO: also configure logging for sub-processes(not master)
        hydra_conf = OmegaConf.load('.hydra/hydra.yaml')
        logging.config.dictConfig(OmegaConf.to_container(hydra_conf.hydra.job_logging, resolve=True))
    return logging.getLogger(name)

def train_worker(config):
    logger = get_logger('train') # this should be a local variable
    # setup data_loader instances
    ...

By the way, I’m currently also working on enabling Hydra and DDP support for a pytorch project template. You can check this branch for my full implementation for DDP.

I’ll create a separate issue to address this.

After looking at #1005, I am not sure anymore. Regardless, this issue has become a hub of unrelated discussions, and I could not find a minimal repro. Happy to look at this if someone produces a minimal example using DDP with Hydra that suffers from logging issues. It should be minimal (a file or two, see the example in #1005 for a good minimal repro). If you can produce it, please file a new issue.

Yes, but it will need be to tomorrow as I just started a long running wav2vec fit using 1.0.4. Soon as it finishes, I’ll test 1.0.6 from the repo and report back.

Finally, I solved this issue in this way. define get_logger function In get logger, the logger is configured as same as hydra’s job_logging config https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/utils/utils.py#L22-L28

def get_logger(cfg, name=None):
    # log_file_path is used when unit testing
    if is_logging_process():
        logging.config.dictConfig(
            OmegaConf.to_container(cfg.job_logging_cfg, resolve=True)
        )
        return logging.getLogger(name)

add hydra’s job logging config to user’s config at trainer.py https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/trainer.py#L143-L144

@hydra.main(config_path="config", config_name="default")
def main(hydra_cfg):
    hydra_cfg.device = hydra_cfg.device.lower()
    with open_dict(hydra_cfg):
        hydra_cfg.job_logging_cfg = HydraConfig.get().job_logging

In unit test In a unit test, I failed to load hydra’s job logging config with HydraConfig.get(). So I used @SunQpark 's way. https://github.com/ryul99/pytorch-project-template/blob/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/tests/test_case.py#L33-L58

        # load job_logging_cfg
        project_root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
        hydra_conf = OmegaConf.load(
            os.path.join(project_root_path, "config/default.yaml")
        )
        job_logging_name = None
        for job_logging in hydra_conf.defaults:
            job_logging_name = job_logging.get("hydra/job_logging")
            if job_logging_name is not None:
                break
        job_logging_cfg_path = os.path.join(
            project_root_path,
            "config/hydra/job_logging",
            str(job_logging_name) + ".yaml",
        )
        if os.path.exists(job_logging_cfg_path):
            job_logging_cfg = OmegaConf.load(job_logging_cfg_path)
        else:
            job_logging_cfg = dict()
        with open_dict(self.cfg):
            self.cfg.job_logging_cfg = job_logging_cfg
        self.cfg.job_logging_cfg.handlers.file.filename = str(
            (self.working_dir / "trainer.log").resolve()
        )
        # set logger
        self.logger = get_logger(self.cfg, os.path.basename(__file__))

config file This way works because we configure the same logger (file and sys out) at both the main process (by hydra) and sub process (by get_logger) So in this way, you should use hydra/job_logging in custom config (because we should know how to config logger specially at subprocess) https://github.com/ryul99/pytorch-project-template/tree/d8feb7fbc9635ae7803cdd3f9575cab7b15673b9/config

Thank you for your suggestion, @omry. I tried that solution, but failed to pass singleton state to the spawned processes. Singleton state seems to be a non-picklable object which can’t be passed through torch.multiprocessing.spawn.

Could you share what errors you get when trying to pickle? we actually pickle SingletonSate in one of our plugins

Thanks @SunQpark. Instead of reading the Hydra config from the file system, try to use HydraConfig singleton to access the Hydra configuration like here. I suspect this will not work out of the box because the singletons needs to be initialized in the spawned process.

You can do that by calling this on the main process, and pass the object down to the spawned function.

singleton_state = Singleton.get_state()

And then initializing the Singlestons from the state in the spawned processes function:

Singleton.set_state(singleton_state)