pytorch-lightning: ValueError: bad value(s) in fds_to_keep, when attemping DDP

I can’t get DDP working without getting the following error:

Traceback (most recent call last):
  File "train.py", line 86, in <module>
    main(config)
  File "train.py", line 41, in main
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
    cmd, self._fds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

What I have tried and that didn’t work:

  • Python 3.7
  • Python 3.6
  • pytorch 1.1.0
  • pytorch 1.2.0
  • Downgrading “scikit-learn” which had helped in unrelated projects according to results on Google
  • Lightning 0.5.3.2
  • Lightning master version
  • Lightning 0.5.2.1
  • CUDA 9.0
  • CUDA 9.2
  • CUDA 10.0
  • Tried removing visdom from the project

The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.

I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 34 (18 by maintainers)

Commits related to this issue

Most upvoted comments

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

Changing the strategy worked for me.

Here’s the code from the doc for someone troubling with this like me.

# Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4)

@williamFalcon mrshenli’s comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the model.cuda(gpu_idx) must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly what model.cuda(gpu_idx) does.

Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week …

@jeffling may you check?

Just in case people want to use the snippet before we support this officially:

You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to ddp.

In your module, you need the following overrides as well:

    def configure_ddp(self, model, device_ids):
        """
        Configure to use a single GPU set on local rank.

        Must return model.
        :param model:
        :param device_ids:
        :return: DDP wrapped model
        """
        device_id = f"cuda:{os.environ['LOCAL_RANK']}"

        model = LightningDistributedDataParallel(
            model,
            device_ids=[device_id],
            output_device=device_id,
            find_unused_parameters=True,
        )

        return model

    def init_ddp_connection(self, proc_rank, world_size):
        """
        Connect all procs in the world using the env:// init
        Use the first node as the root address
        """

        import torch.distributed as dist

        dist.init_process_group("nccl", init_method="env://")

        # Explicitly setting seed to make sure that models created in two processes
        # start from same random weights and biases.
        # TODO(jeffling): I'm pretty sure we need to set other seeds as well?
        print(f"Setting torch manual seed to {FIXED_SEED} for DDP.")
        torch.manual_seed(FIXED_SEED)

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn’t seem likely it’s lightning. Will add more if I learn more.

Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it … almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.