pytorch-lightning: ValueError: bad value(s) in fds_to_keep, when attemping DDP

I can’t get DDP working without getting the following error:

Traceback (most recent call last):
  File "train.py", line 86, in <module>
    main(config)
  File "train.py", line 41, in main
    trainer.fit(model)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
    cmd, self._fds)
  File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
    False, False, None)
ValueError: bad value(s) in fds_to_keep

What I have tried and that didn’t work:

Python 3.7
Python 3.6
pytorch 1.1.0
pytorch 1.2.0
Downgrading “scikit-learn” which had helped in unrelated projects according to results on Google
Lightning 0.5.3.2
Lightning master version
Lightning 0.5.2.1
CUDA 9.0
CUDA 9.2
CUDA 10.0
Tried removing visdom from the project

The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.

I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.

About this issue

Original URL
State: closed
Created 5 years ago
Comments: 34 (18 by maintainers)

Commits related to this issue

Maybe this will help? https://github.com/PyTorchLightning/pytorch-lightning/issues/538 — committed to BatsResearch/taglets by stephenbach 4 years ago

Most upvoted comments

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

Changing the strategy worked for me.

Here’s the code from the doc for someone troubling with this like me.

# Training with the DistributedDataParallel strategy on 4 GPUs
trainer = Trainer(strategy="ddp", accelerator="gpu", devices=4)

charlesxu90 on Jul 10, 2022

@williamFalcon mrshenli’s comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the model.cuda(gpu_idx) must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly what model.cuda(gpu_idx) does.

Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week …

sneiman on Apr 9, 2020

@jeffling may you check?

Borda on Mar 26, 2020

Just in case people want to use the snippet before we support this officially:

You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to ddp.

In your module, you need the following overrides as well:

    def configure_ddp(self, model, device_ids):
        """
        Configure to use a single GPU set on local rank.

        Must return model.
        :param model:
        :param device_ids:
        :return: DDP wrapped model
        """
        device_id = f"cuda:{os.environ['LOCAL_RANK']}"

        model = LightningDistributedDataParallel(
            model,
            device_ids=[device_id],
            output_device=device_id,
            find_unused_parameters=True,
        )

        return model

    def init_ddp_connection(self, proc_rank, world_size):
        """
        Connect all procs in the world using the env:// init
        Use the first node as the root address
        """

        import torch.distributed as dist

        dist.init_process_group("nccl", init_method="env://")

        # Explicitly setting seed to make sure that models created in two processes
        # start from same random weights and biases.
        # TODO(jeffling): I'm pretty sure we need to set other seeds as well?
        print(f"Setting torch manual seed to {FIXED_SEED} for DDP.")
        torch.manual_seed(FIXED_SEED)

jeffling on Nov 26, 2019

For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html

I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.

bryanmooremd on May 20, 2022

FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn’t seem likely it’s lightning. Will add more if I learn more.

Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it … almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.

sneiman on Mar 31, 2020