pytorch-lightning: ValueError: bad value(s) in fds_to_keep, when attemping DDP
I can’t get DDP working without getting the following error:
Traceback (most recent call last):
File "train.py", line 86, in <module>
main(config)
File "train.py", line 41, in main
trainer.fit(model)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
process.start()
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 59, in _launch
cmd, self._fds)
File "/home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/multiprocessing/util.py", line 417, in spawnv_passfds
False, False, None)
ValueError: bad value(s) in fds_to_keep
What I have tried and that didn’t work:
- Python 3.7
- Python 3.6
- pytorch 1.1.0
- pytorch 1.2.0
- Downgrading “scikit-learn” which had helped in unrelated projects according to results on Google
- Lightning 0.5.3.2
- Lightning master version
- Lightning 0.5.2.1
- CUDA 9.0
- CUDA 9.2
- CUDA 10.0
- Tried removing visdom from the project
The error occurs on two servers I tried on, one with 4 Titan X cards and one with 8 Tesla V100 running Ubuntu 18.04.3 LTS.
I suspect that something in my model is triggering it and would appreciate ideas. I can not share the source code though. The model works in dp and single gpu mode.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 34 (18 by maintainers)
Commits related to this issue
- Maybe this will help? https://github.com/PyTorchLightning/pytorch-lightning/issues/538 — committed to BatsResearch/taglets by stephenbach 4 years ago
Changing the strategy worked for me.
Here’s the code from the doc for someone troubling with this like me.
@williamFalcon mrshenli’s comments in pytorch do raise a question for me - he points out that something similar could happen when the model is passed as an arg to a ddp process. I think ptl is probably okay because the
model.cuda(gpu_idx)
must effectively make a deep copy - but just food for thought as I have not been able to confirm exactly whatmodel.cuda(gpu_idx)
does.Also, he inadvertently partially demonstrates something I have been meaning to try in bringing a model back to spawning process from ddp - that is to use the special way in which pytorch handles tensors/models on queues. I suspect if we used a queue() to pass the model to process on gpus[0], the models parameters may be automatically resolved back to cpu - and thus the trained model is available without any special effort. I will try to get to this in the next week …
@jeffling may you check?
Just in case people want to use the snippet before we support this officially:
You need to set Trainer.gpus to your world_size, and Trainer.distributed_backend to
ddp
.In your module, you need the following overrides as well:
For anyone still struggling with this, the issue was fixed for me by switching strategy from ddp_spawn to DDPStrategy(): https://pytorch-lightning.readthedocs.io/en/stable/extensions/strategy.html
I was only seeing the issue when including a validation step in trainer.fit(). DDPStrategy() resolved the issue.
FWIW: Have you recently updated Ubuntu? I just started experiencing this in the last hour - and I am using a local fork that has not changed in a few weeks - so doesn’t seem likely it’s lightning. Will add more if I learn more.
Update I do not believe this is pytorch-lightning. I have recently minted models that are virtually identical that do NOT show this problem. Not clear what is causing it … almost certainly file related as the specific error is a multiprocessing/posix complaint about file descriptors that do not have appropriate values.